Semantic Search - Word Embeddings with OpenAI

2023-03-28T00:00:00+00:00

According to Wikipedia, Semantic Search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query.

For example a user is searching for the term “jaguar.” A traditional keyword-based search engine might return results about the car manufacturer, the animal, or even the Jacksonville Jaguars football team. However, semantic search would analyze the context and intent behind the user’s query, such as whether they are interested in cars or wildlife, and then prioritize results accordingly.

In this blog post, we will explore the underlying principles of semantic search, discuss its advantages over other types of search, and examine real-world applications that are transforming the way we access and consume information.

Lexical Search Engines

Lexical (Traditional) search engines have served us well using keyword-based search methods, looking for matching exact words or phrases in users’ queries with those in documents/database. For example, if we search for the term “computer science intro” in a lexical / traditional search engine, it will return results that match one or more of my search terms.

As you can imagine, the keyword matching approach often falls short when it comes to understanding what the user actually meant, often producing less accurate results.

Semantic Search

Enter semantic search — a context-aware search technology that aims to improve search results by focusing on understanding the meaning and context behind queries.

When a user inputs the query “computer science intro” in a semantic search engine, it would first attempt to understand the intent behind the query. In this case, the user is likely looking for introductory resources related to computer science. Based on this understanding, the search engine would prioritize search results such as introductory computer science courses or textbooks or other learning materials that cover fundamental topics in computer science.

NLP plays a crucial role in semantic search, as it provides the necessary tools and techniques to analyze the context and relationships between words in a query, ultimately helping the search engine understand the meaning and intent behind user queries.

Semantic search can also be implemented using embeddings and vector databases. In this approach, both the user query and the documents in the search database are represented as vectors in a multi-dimensional space. By comparing the distance between vectors, we can determine the most relevant results.

Word Embeddings

Word embeddings are a critical component in the development of semantic search engines and natural language processing (NLP) applications. They provide a way to represent words and phrases as numerical vectors in a high-dimensional space, capturing the semantic relationships between them. The key idea behind word embeddings is that similar words or phrases, like “big mac” and “cheeseburger,” should have closely related vector representations. This proximity in the vector space reflects their semantic relationship, enabling algorithms to better understand the meaning behind words and phrases.

There are several popular algorithms for generating word embeddings, including Word2Vec, GloVe, OpenAI embeddings and FastText. These algorithms work by analyzing large corpora of text data, learning the context and co-occurrence patterns of words, and then generating vector representations to capture these patterns. Words that often appear in similar contexts will have similar vector representations.

In short, word embeddings is powerful technique to represent words and phrases as numerical vectors. The key idea is that similar words have vectors in close proximity. Semantic search finds words or phrases by looking at the vector representation of the words and finding those that are close together in that multi-dimensional space.

For example, here’s how the process works in general if we use OpenAI Embeddings APIs for generating vector representations:

Document (or Dataset) embeddings: First, you would need to generate embeddings for the documents in your search database. You can use OpenAI embeddings to obtain these embeddings. Then you can store these vectors in a vector database or an efficient data structure like an approximate nearest neighbors (ANN) index.
Query embeddings: When a user submits a query, you’d call the GPT-3 API to generate an embedding for the query using the same method as the document embeddings. This returns a single query vector.
Similarity search: Compare the query vector to the document vectors stored in the vector database or ANN index. You can use cosine similarity, Euclidean distance, or other similarity metrics to rank the documents based on their proximity (or closeness) to the query vector in the high-dimensional space. The closer the document vector is to the query vector, the more relevant it is likely to be.
Retrieve and display results: Finally, retrieve the top-ranked documents sorted by their similarity scores and display them to the user as search results.

Keep in mind that this is a simplified example, and a full-fledged semantic search engine would involve additional considerations like query expansion, context-awareness, and other techniques to improve search results.

Word Embeddings Complete Example on Github

In the Python notebook linked below, we walk through the process of building a simple semantic search engine using word embeddings from OpenAI to illustrate basic concepts.

First let’s define our dataset of a few words that we’ll be searching against.

["hamburger", "cheeseburger", "blue", "fries", "vancouver", "karachi", "acura", "car", "weather", "biryani"]
dataset = pd.DataFrame(l, columns=['term'])
print(dataset)

This prints:

         term
   hamburger
cheeseburger
        blue
       fries
   vancouver
     karachi
       acura
         car
     weather
     biryani

Next, we covert our dataset to embeddings by calling the embeddings API of OpenAI and storing their vector representations in our “database” (dataframe.)

from openai.embeddings_utils import get_embedding

dataset['embedding'] = dataset['term'].apply(
    lambda x: get_embedding(x, engine='text-embedding-ada-002')
)

# print terms and their embeddings side by side
print(dataset)

This outputs:

         term                                          embedding
   hamburger  [-0.01317964494228363, -0.001876765862107277, ...
cheeseburger  [-0.01824556663632393, 0.00504859397187829, 0....
        blue  [0.005490605719387531, -0.007445123512297869, ...
       fries  [0.01848343200981617, -0.030745232477784157, -...
   vancouver  [-0.011030120775103569, -0.023991534486413002,...
     karachi  [-0.004611444193869829, -0.001336810179054737,...
       acura  [0.0055086081847548485, 0.013021569699048996, ...
         car  [-0.007495860103517771, -0.021644126623868942,...
     weather  [0.011580432765185833, -0.013912283815443516, ...
     biryani  [-0.009054498746991158, -0.015499519184231758,...

Now we are ready to search. We prompt user to enter a keyword. Once the user enters the keyword, we call the API to get its vector representation:

from openai.embeddings_utils import get_embedding

keyword = input('What do you want to search today? ')

keywordVector = get_embedding(
    keyword, engine="text-embedding-ada-002"
)

# print embedings of our keyword
print(keywordVector)

To find the matching results, we iterate through the vectors of our dataset and apply cosine similarity to find the distance between keyword vector and each vector in the dataset. We printing top 3 results, sorted by the distance between vectors (keyword and dataset) in descending order.

from openai.embeddings_utils import cosine_similarity

dataset["distance"] = dataset['embedding'].apply(
    lambda x: cosine_similarity(x, keywordVector)
)

dataset.sort_values(
    "distance", 
    ascending=False
).head(3)

Here’s are the top 3 results for the keyword “big mac” from our dataset. As you can see, it correctly inferred that I was referring to a burger and found the right matches!

term	distance
hamburger	0.853306
cheeseburger	0.841594
fries	        0.823209

Here’s the complete code example:

(GitHub - Example) Word Embeddings and Semantic Search using OpenAI

Brief Overview of Caching and Cache Invalidation

2022-04-03T00:00:00+00:00

Caches are present everywhere: from the lowest to highest levels:

There are hardware caches inside your processor cores (L1, L2, L3),
Page/Disk cache that our Operating Systems
Caches for databases such as using MemCached, Redis or DAX for DynamoDB
API caches
Layer-7 (Application layer) HTTP caches like Edge level caching in CDNs
DNS caching
Cache in your browser
Microservices can have internal caches to improve their performance for complex and time consuming operations
You are reading this post thanks to many intermediary caches

I can keep going but you get the point: Caching is ubiquitous. Which begs the question, why do we need caching? Before you scroll down for the answer, take a few seconds to think about the answer.

What is a Cache?

A cache is a fast data storage layer for storing a subset of data on a temporary basis for a duration of time. Caches are faster than original sources of data so they speed up future data retrievals by accessing the data in cache as opposed to fetching it from the actual storage location. Caches also make data retrievals efficient by avoiding complex or resource intensive operations to compute the data.

When the application needs data, it first checks if it exists in the cache. It if does, the data is read directly from the cache. If the data is not in cache, it is read from primary data store or generated by services. Once the data is fetched, it is stored in the cache so for future requests, it can be fetched from the cache.

Why do we need Caching?

Typically, there are two main reasons for caching data:

We cache things when the cost of generating some information is high (resource intensive) and we don’t need fresh information each time. We can calculate the information once, and then store it for a period of time and return the cached version to the users.
Arguably the top reason why we use caching is to speed up data retrieval. Caches are faster than original sources of data and cached information can be retrieved quickly resulting in faster responses to users.

Caching Example

Let’s look at an example. Suppose we have a webpage that displays “Related Content” links on the sidebar. This related content is generated by machine learning algorithms by processing large volumes of data in the main database, and can take several seconds to compute.

This is a complex and resource intensive operation: each user request has to calculate this information. For popular pages on the website, a significant amount of time and resources will be spent computing the same data over and over again. Impact: Increased load on backend servers and databases, and higher cloud infrastructure costs.
Generating “Related Links” takes time and holds up the final response that’s sent to users. Impact: The response times increase that hurt user experience and page performance metrics such as the Core Web Vitals that search engines use.

To address both these issues, we can use a “Cache”. We can computed the Related Links once, then store it in the cache and return the cached copy for several hours or even days. The next time the data is requested, rather than performing a costly operation and waiting for several seconds for it to complete, the result can be fetched from cache and returned to users faster. (This type of caching strategy is called Cache Aside.)

Why we use caching? Because it speeds up information delivery and reduces the cost of calculating that information over and over again.

Cache Invalidation

We have seen how useful caches can be: they save costs, scale heavy workloads, and reduce latency. But like all good things, there’s a catch or rather trade-offs that developers must be aware of.

Phil Karlton, an accomplished engineer who was an Architect at Netscape famously said the following which also happens to be my favorite quote:

There are only two hard things in Computer Science: cache invalidation and naming things - Phil Karlton

Cache invalidation is the process of marking the data in the cache as invalid. When the next request arrives, the corresponding invalid data must be treated as a cache-miss, forcing it to be generated from the original source (database or service.)

Caches are not the source of truth for your data. That’d be your database (or a service.) The problem happens when the data in your database (source of truth) changes, leaving invalid data in the cache. If the data is the cache is not invalidated, you’ll get inconsistent, conflicting or incorrect information. For example, suppose we cached the price of an item and the supplier increases it in their system.

Cache invalidation is indeed a hard problem. Why? Because we effectively need to deal with the dependency graph of all the inputs that gave us the result we cached. Any time even a single input changes, we have a stale or invalid result in the cache. Miss just one subtle place, and we have an issue. The program will still work making is very difficult to track down the exact issue and fix the cache invalidation logic. If you have a function with well defined inputs and outputs then it’s not that hard to catch issues. In fact, when it doesn’t work at all is usually one of the simpler things to find and fix. Cache invalidation bugs leading to the program “mostly working” makes somewhat trivial bugs fiendishly hard to discover.

Let revisit our earlier example of caching “Related Content” links (links to other related pages for a webpage.) Suppose one of the linked pages is no longer present in the system: it was taken down by an admin because of a complaint. We forgot to capture this input for cache invalidation. Now we get a “mostly working” system that results in users getting an HTTP 404 error when they click on a broken link. Debugging is very difficult because the actual page (that’s hosting the broken link) is not broken in any way. We only see HTTP 404 errors in the logs and troubleshooting turns into a nightmare.

In distributed systems with several inter-connected caches, invalidation becomes even more difficult thanks to many dependencies, race conditions and invalidating all the caches that need to be updated. Distributed caching has its own challenges at scale and some complex systems like Facebook’s Tao use cache leaders for handling invalidations for all data under their shards.

Heck, it is easy to run into cache issues during the course of normal software development. Modern CPUs have several cores and each has its own cache (L1) that’s periodically synced with the main memory (RAM). In the absence of proper synchronization, values stored in variables on one thread may not be visible to threads. For example:

foo = 2;

In Java, the JVM might update the value of foo in the local cache and not commit the result to memory. A thread running on another core may see a stale value for foo. (This is one of the primary reasons why writing multithreading applications is hard.)

In summary, caching is a super useful technique. But it can easily go wrong if we are not careful. When using a cache, it’s important to understand how and when to invalidate it and to build proper invalidation processes.

When to Not Use a Cache

Caches are not always the right choice. They may not add any value and in some cases, may actually degrade performance. Here are some questions you need to answer to determine if you need a cache or not.

The original source of data is slow (e.g. a query that does complex JOINs in a relational database.)
The data doesn’t need to change for each request (e.g. caching real-time sensor data that your car needs when it’s in the self-driving mode or live medical data from patients… not good ideas.)
The operation to fetch the data must not have any side-effects (e.g. a Relational DB Transaction that fetches data and updated KPI counters is not a good caching candidate due to side-effect of updating counters.)
The data is frequently accessed and needed more than once.
Good cache hit:miss ratio and total cost of cache misses. For example, suppose I put a cache for user requests as they come in and it takes 10 ms to check if the data exists in the cache or not, vs the original time of 60 ms. If only 5% of requests are cached, I’m adding an additional 10ms to 95% of the requests that result in a cache-miss. Doing rough calculations, we can see that cache is actually hurting performance:

Before cache: 1,000,000 requests * 60 milliseconds per request = 60,000,000 milliseconds total
After cache: (0.05 * 1,000,000 * 10) + (0.95 * 1,000,000 * (60 + 10) ) = 67,000,000 milliseconds total Each cache miss results in 60+10 millisecond That’s poorer than using no cache, assuming all requests are equal in value/distribution.

Caching Strategies

There are many different ways to configure and access caches. Various cache strategies that are covered in this post.

How To Manage Employees Who Are Going Through a Difficult Period

2022-01-02T00:00:00+00:00

Managing a team member who’s going through a difficult period is challenging scenario for any manager. It is not as uncommon as one might think. Major life events like death of a family member, prolonged illness of a child or spouse, separation or divorce, etc. are very stressful situations for almost everyone. Some might do a better job than others keeping their work and personal lives separate, but it is not uncommon for these these major life events to creep in and start affecting work performance of even top performers

So what can managers to do when one of their reports is going through a difficult period and it is affecting their ability to do their work? Two things are key here:

You have to show compassion and help the employee,
You are also responsible for your team’s productivity and producing results for your company.

Depending on the situation, it can like walking a tightrope. Managers must navigate carefully and come up with a plan. Here are some things managers can do in these situations.

1. Don’t Wait - Early Detection Goes a Long Way

As a manager, you shouldn’t expect your employees to come to you and share that they are going through a stressful period. If they do, great. But good managers must detect early warning signs and start the conversation early. If they start showing up late to meetings consistently, miss deadlines, quality of work goes down, etc. sit down and talk to them.

If you wait too long and don’t take an action, the entire team might suffer or the situation could get so bad that the employee might just directly resign, without ever giving you the opportunity to explore a solution together.

2. Understand the Situation

When you talk, and he or she opens up and shares that they are going through difficult times, the best thing you can do is listen intently and understand the situation as much as possible. Show compassion and care, but avoid being a therapist. Understand if it’s short-term challenge or something that would potentially drag for many months.

3. Don’t Solve Solo. Brainstorm With Your Employee

Many managers when faced with this scenario are tempted to offer time off or flexible hours to their employees. While these are good solutions in many situations, they aren’t always. Instead, brainstorm solutions with your employees. Start the conversation by expressing your support and genuine sympathy and asking the employee what the two of you and the company can do to support them during this difficult period and how to best manager their responsibilities.

May be the employee is working on a challenging project that requires a lot of collaboration and meetings and those are just not feasible at the moment. May be the on call responsibilities are too much for them for the next few weeks. The point is there could be ways to support your employee during personal crisis. It could even a combination of different things: flexible schedule plus independent tasks, as an example.

4. Make a Plan

After brainstorming, make a plan that works for the employee and helps them go through their difficult situation while also meeting modified work expectations. Set realistic goals. Also make sure that the plan is in-line with your company’s HR policies. If you are not sure, review your company policies or if you have a good HRBP, talk to them.

The team usually knows about these situations because chances are they are a part of the solution in some way or other. However, if they don’t, you should let them know about any adjustments that you are making. The team will notice and if they don’t have the context, the team culture might suffer.

I must add that you should respect the employee’s privacy at all times. If they don’t want to share details with the team, then don’t share any details and keep it high level.

6. Follow up

Once you have a plan, follow up with your employee time to time in your 1-1’s or even over Slack, Email. Don’t pry, instead show care, and ask how they are doing, how the plan is working and if any adjustments need to be made.

Summary

An employee going through a personal crisis or difficult period can have a serious impact on their and teams’ productivity. Managers should be on the look out for any warning signs. Once the issue is detected, proceed with careful planning and compassion towards the employees who personal problems are affecting their work performance.

Code Reviews During Emergencies

2022-01-01T00:00:00+00:00

It’s 3:40pm on a fine Friday afternoon. You are about to wrap up the main logic for a feature you’ve been working on for a couple of days when you notice you have unread Slack notifications. One catches your eye in particular:

“Hey! Emergency. Need approval on my PR. I have tested everything on dev so if you can quickly approve, I can merge it before the weekend.”

Emergency. Alright, they have your attention. You could stop what you are doing and do a superficial code review on their pull request (PR) because after all, it’s an emergency. But you are curious and fire back seeking to understand the context:

“Sure, I can take a look right now. What’s the emergency? Is it a bug in production or does the legal team want something updated ASAP?”

They reply back right away:

“No, I want to merge this code that does <insert something innocuous> because I’m OOO on Monday and want to get this released today”

Urgent? Perhaps. But does this constitute an emergency? Definitely not.

In this post, we’ll discuss how to review code during emergencies. Let’s first begin by establishing an understanding of what constitutes a real emergency.

Emergencies

An emergency is a critical bug in production that’s affecting users, a major security issue, an urgent legal concern or something that’s blocking a major feature release that has a significant impact on KPIs.

I think it’s easier to understand if we look at a few examples of what is not an emergency:

Not Emergencies

The following examples are not emergencies.

The author has been working very hard and long hours on the feature and wants to get it out ASAP.
It’s 30-minutes to the weekend code-freeze deadline so it’ll be nice to get it merged.
The author’s manager told them that it’d be great if they can release the feature this week.
The author will be away for the next couple of days.

These could relate to urgent situations or special circumstances that could definitely be important. Delaying a release may not be ideal, but it’s not usually catastrophic, hence not an emergency. (The only exceptions are feature releases where something bad would happen if it were to get delayed.)

Code Review Protocol During Emergencies

You might ask yourself that if it’s so important to fix the bug during emergencies as soon as possible, should we even bother with code reviews? The answer is that yes, you should definitely review all changes, including during emergencies. However, during emergencies, we should adapt our code review process and make some changes to it.

Emergencies require special attention and a modified code review process. For example, during normal code reviews, you might look at various aspects such as complexity, style, code documentation, unit tests, etc. and provide suggestions or nitpick. However, that won’t work during emergencies, where the goal is to complete the code review process as fast as possible.

Here are four things code review authors and reviewers should do differently during emergencies:

1. PR Authors: Keep the Change Small and Focused

The authors of the PR should make sure that it is a small change going out to specifically address the emergency only. Do not add anything extra that could confuse reviewers and introduce potential delays. Keep your change small and focused.

2. Reviewers: Complete Review as Fast as Possible

While normal code reviews can take anywhere from a few hours to a few days to complete, during emergencies it is critical that code reviews are completed as fast as possible. Speed is the main priority and takes precedence over everything else. Just like the author of the PR dropped everything to get the fix out, the reviewers should do the same and make code review their top priority.

3. Reviewers: Limit Code Review Focus

During the normal code review process, a reviewer may look at various aspects of the code such as its design, complexity, code style, to name a few. However, during emergencies, the reviewers should limit and restrict their focus to one and one thing only: “does this address and resolve the emergency?” In other words, the main criteria is that if we merge this code into production, will it make the bug go away? Refrain for providing suggestions or opinions as they’d have no impact on the PR’s ability to address the issue. I’ll repeat myself here: speedy code review is your top priority here.

4. Everyone: Follow-up With More Through Code Review

Once the change has been released to production and the emergency has been resolved, and you have all taken a deep breath and relaxed, it’s time to revisit and do the regular or more thorough code review. Do not skip this step if reviewers noticed things during ‘emergency code review’ that you didn’t call out due to urgency at the time. For example, you may have noticed that the code was poorly structured and too complex to be understood in the future and maintained. Call it now so the author can follow up with another PR to address these issues.

Summary

In this post, we looked at how to do code reviews during emergencies such as handling critical bugs in production. In emergency situations, we should not skip the code review process, but rather adapt it to focus on the speed of completing the review, keeping the code review focused on the code’s ability to address the issue and then following up with a more formal code review process.

I would love to hear your feedback, comments, and thoughts. Please leave a comment below sharing your experience or anything that would add value to this article and its future readers.

Burnout in Software Development - Survey Results 2021

2021-10-01T00:00:00+00:00

Burnout is very common in software development. Intense mental focus, heavy workloads, never ending roadmaps, under-staffed teams, unclear targets, and many other factors lead to developer burnout.

I have experienced burnout in my career, perhaps twice. Early on, I had mild symptoms and my manager recognized it and helped me manage it. Later, after a very long stretch of working very long hours, 7 days a week on some super unrealistic goals with very little support, I ended up getting a bad case of burnout. It took me weeks to recover from it.

When COVID-19 forced remote work shift began in companies, it was a step in the right direction by giving developers the flexibility to work from anywhere. However, I started hearing more and more about burnout. In this article, I will share results of a developer survey we did to explore if COVID-19 is somehow making the burnout situation worse.

What is burnout?

Burnout is a form of exhaustion caused by constantly feeling swamped. It’s a result of excessive and prolonged emotional, physical, and mental stress. In many cases, burnout is related to one’s job.

Burnout happens when you’re overwhelmed, emotionally drained, and unable to keep up with life’s incessant demands.

If it sounds very broad, it actually is. Burnout doesn’t have a test that your doctor can prescribe and be certain whether you are suffering from it or not. Regardless, if it’s not identified and addressed, it will continue to grow until it grinds you down and become the new normal. When that happens, it is not easy to pull yourself back out. It might take many months or even years to shake it off.

Here are some of the common symptoms of burnout:

You stop enjoying working on things you used to enjoy and lose motivation for weeks.
You feel exhausted or fatigued all or most of the time. You have low energy at work and at home. You struggle to sleep and wake up feeling tired.
You don’t feel a sense of accomplishment or feel sort of hopeless about results.

(The symptoms are pretty similar to Depression. Unlike burnout, depression is a clinical condition which requires immediate medical attention.)

What causes burnout?

Here are some of the common causes of burnout:

Lack of control. An inability to influence decisions that affect your job — such as your schedule, assignments or workload — could lead to job burnout. So could a lack of the resources you need to do your work.
Unclear job expectations. If you’re unclear about the degree of authority you have or what your supervisor or others expect from you, you’re not likely to feel comfortable at work.
Dysfunctional workplace dynamics. Perhaps you work with an office bully, or you feel undermined by colleagues or your boss micromanages your work. This can contribute to job stress.
Extremes of activity. When a job is monotonous or chaotic, you need constant energy to remain focused — which can lead to fatigue and job burnout.
Lack of social support. If you feel isolated at work and in your personal life, you might feel more stressed.
Work-life imbalance. If your work takes up so much of your time and effort that you don’t have the energy to spend time with your family and friends, you might burn out quickly.

How to deal with burnout? Here’s a a good must-haves list to prevent burnout:

Exercising 2 - 3 times a week
Eating well (fruits and veggies)
8 hours of sleep at least 4 nights a week
Unplugging totally for at least a week a year

I’ll add the following as well:

Set aside some time (at least 30 minutes) to go for a walk everyday. Don’t check your Slack or emails when walking.
Don’t work from your bedroom. If you can, move your desk to a different room or another area. At the end of the day, walk away from your desk marking the end of the work day. Maintain proper work-life balance.
Strongly second sleeping at least 8 hours more than once a week.
Spend time with your friends and family.
Meditate.

Developer Survey Results

Let’s look at the results of the software developers survey to understand the scope and impact of burnout given COVID-19 related changes in workplaces. Here are the survey stats:

A total of 504 developers participated in the survey.
71% of the participants who took the survey were from the US, while the remaining 29% were from rest of the world.
87% of the participants were individual contributors with titles such as Software/Lead/Staff/ Principle Engineer or DevOps. Rest were in (engineering) or product management.
62% worked in companies with more than 500 employees.

82% of all developers indicated that they have experienced burnout in last 6 to 8 months

The first question on the list asked software developers if they are experiencing or have experienced burnout in the last 6 to 8 months. A whopping 82% said yes. Almost 50% developers cited feelings of burnout to a “great” or “moderate” extent.

73% developers said burnout is negatively impacting their productivity or personal life

This was not a surprise. Lack of energy and motivation directly leads to lower productivity at work. Burnout also affects personal life and relationships.

Developers indicated increased workload and and poor work culture as the main reason

This was a multiple choice question. There were many options to choose from and participants could even come up with their own answer. I have summarized work-related reasons as “Poor work culture” and pandemic related as “COVID-19”.

57% developers indicated that COVID-19 has made the situation worse

Participants were also given an option to provide more details by leaving comments. The top theme was not having to commute to the workplace every day as the silver lining.

77% of the developers indicated that their management is not aware of the burnout or not taking any steps to help its employees manage it

Only 23% said that their company has formally recognized burnout and other challenges and taking concrete steps such as additional days off, reasonable worloads, etc. to ensure employee well-being. Kudos to companies who are doing this!

78% of the developers said they are planning to switch their job within the next 12 months

I was expecting this number to be high, but not this high. Only 22% cited burnout as the reason behind their desire to switch jobs. I wish I had a comments enabled for this question to better understand the reasons why such large percentage of developers are considering a career change.

Other comments

Developers surveyed were asked to leave their comments. I can’t post them all, but here are few of them.

Lots of engineers left. The workload has increased significantly and management can’t hire enough people. Working longer hours and wait for the weekends to recharge. But weekends are mostly groceries, chores, helping kids with their homework.. the cycle continues

Inability to disconnect from work, expectations of being always reachable and no thought for planning for alternate coverage.

I work as a QA for 3 years and I’ve recently got hired after being fired by low productivity. My understanding is the overload of what clients consider on our job. It should be related to software Quality Assurance, but because I only dealt with legacy systems (bank and flight), QAs don’t even know the expected behavior, not even the users themselves. I’m planning on switching jobs.

I’m still fairly new to my position, and it takes time to get used to a new codebase, company, and processes, which makes my job more stressful and creates pressure to get up to speed.

There you have it. Thanks to everyone who participated in the survey and provided their feedback. Leave your comments in the comments section below.

How to use Feature Flags in Node.js

2021-09-12T00:00:00+00:00

Feature flags (or feature toggles) is a powerful technique used by modern software teams to control the behavior of their code and features in production. Feature flags are used for:

rolling new features out gradually
branch by abstraction
experimentation and testing
migrations
geographic restrictions
permissions or kill switch
testing in production

In this article, I’ll show you how to use feature flag in Node.js. For this example, let’s assume we have an e-commerce store and we’re implementing a new feature to change the sort order of the items because in our hypothetical example, the marketing team thinks changing the sort order will lead to more revenue. I’ll be using an open-source Node project called Crate which is an eCommerce subscription service for trendy clothes and accessories.

1. Put your feature behind feature flag in code

First, you’ll need to define the new feature in such a way it can be shown or hidden easily.

if featureFlag == "on" then
    // show the new feature: new sort order
else
    // show the old feature: old sort order

Once you have wrapped your code (new and old features) in a feature flag, you can easily control the behavior by changing the state of the feature flag or its targeting rules. If the flag is enabled (“on” variation,) we show the new feature. Otherwise, we don’t.

2. Create a feature flag

Next, create a new feature flag in Unlaunch that will control the feature you’re building. You can call it the “default-sort-order” or “crate-products”.

Targeting Users

A user is any object such as an email address, unique user id, a hash etc. to represent a unique user for which a feature flag is evaluated. Let’s define an internal user (engineering, QA or product team member) by email address to always show the “on” variation so they always get the new feature.

Targeting Rules

The rules that define which variation is evaluated are known as flag targeting rules. Here you can target by user attributes such as new users vs old, or by geo-location. Default Rule is the “catch-all” rule when targeting rules don’t match. In this case, the goal is to enable the feature for 10% of the users. To do that, we’ll define percentages under Default Rule.

3. Integrate Unlaunch Node.js SDK in your app

You’ll need to use the Unlaunch Node.js SDK to evaluate the feature flag you just defined from your app. If you’re using NPM, you can easily install the SDK:

npm install --save unlaunch-node-sdk

4. Call the feature flag and use variation to show or hide the feature

Open the file where wrapped the new feature in step #1. Let’s call the feature flag we just defined and use its result instead.

const UnlaunchFactory = require("unlaunch-node-sdk")

const variation = unlaunch.client.variation("crate-products", auth.user.email); 
 
if(variation == 'on'){
    return await models.Product.findAll({ order: [['id', 'ASC']] })
} else {
    return await models.Product.findAll({ order: [['id', 'DESC']] })
}

Here’s the actual commit.

Here’s what the store looks like when the feature is disabled.

When you enable the feature, we can see the new sort order.

The cool thing is that now, we can enable or disable the feature on demand. We can roll it out to more users, kill it if it’s misbehaving, without changing the code or deploying new releases.

Summary

Feature flags are easy to set up and add a lot of flexibility to your application. Even if you use a feature flag service provider or create your own solution, I’m confident that your team will appreciate the idea of decoupling feature releases from code deployments because it removes a major source of stress.

GitHub Repo

Here’s the complete source code used in this blog post is available on GitHub. Here’s the commit with comments describing changes: https://github.com/Mahrukhizhar/crate/commits/main.

Learn More

Blog post from Pete Hodgson at MartinFowler.com is a good read.
Getting Started with Feature Flags
Unlaunch Node.js SDK

How to Toggle Features in C# with Feature Flags

2021-08-26T00:00:00+00:00

This blog post was contributed by Tuan Nguyen, Software Developer at Getty Images.

Using feature flags to release new features to customers is a powerful technique. I have been using feature flags for many years to release features or changes to production safely and with the peace of mind. The releases are done without any ceremony, and the features aren’t visible to customers until we turn the feature flag on. Using feature flags to toggle functionality in .NET Core apps is easily achievable using a feature flag management platform.

In this tutorial, I’ll show you how to toggle features on demand in a simple web application using Unlaunch to create and manage feature flags. We’ll also see how to show or hide our features with a click of a button, without merging branches, rollbacks or deploys.

Feature Flags

Feature flags allow developers to control who sees new features irrespective of code deployment. For example, developers can deploy a new feature to production environment and keep it hidden from all users (except themselves.) This way, they can do testing on real systems, and when the management is ready to release the feature, they can turn on the feature flag to let the users in on it. To me, here are some of the benefits of using feature flags:

Developers eliminate some of the risk knowing they can roll things back instantly if things go wrong. On a few occasions, I rolled back features because of negative impact to KPI we didn’t consider before or one of the underlying system wasn’t quite ready.
Developers can launch the feature on production behind feature flag and show it to product or marketing team for their feedback.
QA in production. This doesn’t mean that we skipped testing on the ‘Dev’ environment, but rather launching the feature on production just for the development team give extra boost of confidence. It also allowed testing with real data which is not always up to date on ‘Dev’ environment.
Gradually roll features out starting with a small percentage of users. This is also known as canary releases.

Basics of Feature Flags

A feature flag is a simple object that has a Boolean value or variations such as “on” or “off” Based on the value of the variation, you either do one thing or the other.

string feature = getFeatureFlag("FLAG-ID");// Get feature flag

if (feature == "on") 
{
    // show new code 
} 
else 
{
    // show old code
}

You can use a use a config file (appsettings.json) or a database to keep feature flags. You can also use a number of feature flag platforms which provide a web UI to manage and control feature flags and SDKs to make it easy to integrate across the stack.

Implementing Feature Flags in .NET Core

I’ll show you how to implement and use feature flags in .NET Core and ASP.NET. I will be using the Unlaunch SDK for.NET to manage and retrieve feature flags from my code. We’ll use the eShop Web application. Our imaginary use case (inspired from real-life) is to change the sort order on our store. We want to show items in reverse order.

I’ll show you how to implement the new feature (sort items in reverse order) and put it behind a feature flag. When the feature flag is turned “on”, the items will be sorted in reverse order. If the feature flag is turned “off”, the sort order will not change.

Before Your Get Started

Register an Unlaunch Account

Go to Unlaunch.io and create a free account. As part of sign up, create a new project. (The name of the project should match that of your application. I named my project ‘eShopWeb’.) Next, I created a new feature flag called ‘catalog_reverse’. Leave everything to its default value. Make sure you enable the flag by clicking on the “Enable Flag” button in the top right corner. To make the feature flag return “on” variation (to signal that we should show the feature to all users), choose “on” variation under “Default Rule”.

Access the Feature Flag from .NET Core Application

Install the Unlaunch SDK

To use the feature flag ‘catalog_reverse’ in our application, we need to integrate the Unlaunch .NET SDK.

Install the Nuget package or run:

Install-Package unlaunch

Initialize Unlaunch Client as a Singleton

In src/Web/Startup.cs, initialize the Unlaunch client. The client will download the feature flag right away and will poll for changes every 60 seconds. You can customize the polling interval, but in our case, 60 seconds works great. A number of feature flag platforms like LaunchDarkly uses a streaming architecture but I prefer polling to avoid any extra overhead and in most cases, it is okay to wait up to a minute to disable the feature if the flag is turned off using the web UI.

services.AddSingleton<IUnlaunchClient>(UnlaunchClient.Create("YOUR_SDK_KEY"));
services.AddTransient<UnlaunchService>();

To find your SDK key, in the Unlaunch web UI, and click on “Settings” in the sidebar. Copy the Server Key for the environment you’re working with (Production in my case.) Additional information: Unlaunch provides 3 different types of SDK keys for security purposes. For example, the “Browser Key” is for client-side SDKs like React that run on client side. To access feature flags with this SDK key, you must update flag settings to make it available on client-side.

Evaluate feature flag and get “on” or “off” variation

I created a new service for Unlaunch called UnlaunchService within the application to handle feature flag logic in there and inject this service in other places.

namespace Microsoft.eShopWeb.Web.Services
{
    public class UnlaunchService
    {
        private const string CatalogReverseFlag = "catalog_reverse";

        private readonly IUnlaunchClient _client;

        public UnlaunchService(IUnlaunchClient client)
        {
            _client = client;
        }

        public bool IsCatalogReverseFlagEnabled(string userIdentity)
        {
            var variation = _client.GetVariation(CatalogReverseFlag, userIdentity);
            return variation == "on";
        }
    }
}

I created a wrapper function (IsEnabled) to return whether the feature flag is enabled or not. It takes a string argument called userIdentity. This is important when doing gradual or phased rollouts as Unlaunch SDK uses hash of userIdentity to determine whether to show the feature or not to a particular user. For this example, this argument is not used because we are turning the flag on or off for all users. But we’ll pass it anyways because we’ll be using it later.

Then in my Index.cshtml, I check if the feature flag is enabled or not and if enabled, I sort the items in the reverse order.

<div class="esh-catalog-items row">
    @{
        if (UnlaunchService.IsCatalogReverseFlagEnabled(HttpContext.Connection.RemoteIpAddress.ToString()))
        {
            Model.CatalogModel.CatalogItems.Reverse();
        }

        foreach (var catalogItem in Model.CatalogModel.CatalogItems)
        {
            <div class="esh-catalog-item col-md-4">
                <partial name="_product" for="@catalogItem" />
            </div>
        }
    }
</div>

As we can see, items in the shop are reverse sorted when the feature flag is turned “on”.

Targeting User Segments with Feature Flags

Targeting Specific Users By ID

Suppose you want to turn the feature on only for internal users so they can see how things look. For example, if I wanted to keep the feature off for everyone and only enable it for one of my colleagues whose working from home, I can target by IP address. (Alternatively, you can also target users by other criteria e.g. all users whose email address ends with xyz@company.com, but if you recall, our shop is visible to logged our users and we’re passing IP address as identity.)

Targeting by Geo-location

Often times, we roll out feature starting with countries that would have limited impact on revenue or other KPIs if things go south. For example to target everyone from Germany (DEU) and Austria (AUT). This often allow to learn the behavior and impact of a new feature before opening it up for broader audience. To do this, create a new Attribute in Unlaunch of type “Set” and call name “country”.

To make this work, in our code we’d have to pass an additional attribute that contains user’s country so the SDK can decide which variation to show. In the Unlaunch service class, I updated the isEnabled to resolve user country by IP and pass it to the SDK.

public bool IsEnabled(string userIdentity)
{
    var variation = _client.GetVariation(
        FlagKey, 
        userIdentity
        UnlaunchAttribute.NewSet("country", GetLetterCountryCodeByIP(userIdentity)),
    );

    return variation == "on";
}

Targeting by Email

Suppose, we want to target all logged in users in your company. We can define a targeting rule to show the feature to everyone whose email ends with @@yourcompanydomain.com.

You’d have to start passing the email address to the SDK so it can make the decision.

public bool IsEnabled(string userIdentity, string email)
{
    var variation = _client.GetVariation(
        FlagKey, 
        userIdentity
        UnlaunchAttribute.NewString("email", email),
    );

    return variation == "on";
}

I just gave you a few examples that I’ve been using to target segments using feature flags. The possibilities are endless and in my opinion, this is where feature flags truly shine. You can release a feature on production and turn in on only for specific users (just your product manager or QA engineer,) or everyone in the company, or everyone in certain countries, etc. Try it out and have fun with it.

Summary

GitHub Repo

Here’s the complete source code used in this blog post is available on GitHub. Here’s the commit with comments describing changes: https://github.com/tnguyenquy/eShopOnWeb/commit/116ee293f79b7bac52fd7957dfa2feb033384c2b.

Learn More

Blog post from Pete Hodgson at MartinFowler.com is a good read.
Getting Started with Feature Flags
Unlaunch .NET SDK

The Complete Guide to Feature Flags

2021-02-27T00:00:00+00:00

The term feature flags refers to a set of techniques that allow software developers and teams to change the behavior of their system in production without modifying or even deploying code.

Because of their ability to modify system behavior on the fly, feature flags are very powerful and versatile. They facilitate many use cases that boost developer productivity and make the user experience better and faster: gradually rolling out new features to users vs big-bang releases, testing in production, canary launches, experimentation and many more.

We’ll explore feature flag use cases later. But first, let’s examine feature flags in depth and see how they work.

What is a feature flag?

At the core of this amazing concept, lies a dead simple and basic foundation that uses conditional code (if statement) to determine whether to perform an action or not.

This is best explained with an example. I’ll use a real one that I worked on not too long ago. We wanted to allow our users to sign-in to our site using their Google account.

To do this, you first create a developer account with Google and get the OAuth code. The application must be approved by Google before you can use it on your site. Until then, it can only be used in test mode.

Traditionally, the code for Google sign-in will be kept in a separate Git branch. When the application is approved by Google, the branch can be merged into develop or main (aka master) so it can be released to users.

The challenge was that the verification process may take several weeks. I tested everything locally to make sure it was all working. But I couldn’t release it yet, even for internal users. If I merged into develop to deploy the feature on the development environment, it would also release it to production (since release branches were automatically cut off of develop.)

To summarize the issue: I wanted to release an unfinished feature to the internal team for feedback and dogfooding. I had no way to show it to the team without merging it to the develop branch.

I’d ask you to pause here and think how you’d solve this problem? Scroll up and read again if you just skimmed through.

The first solution that may come to mind is quite simple. Introduce a new variable, a config parameter e.g. showSignInButton or an environment variable e.g. IS_PRODUCTION. Use this in an conditional or if-else statement in the code to disable the feature in production but keep showing it on the development environment.

And voila! You have arrived at the (most basic) definition of feature flag.

To use feature flags in our code, there are three core things that we need:

Find the seam where you could hide or disable the feature. In the example I shared above,the seam was the login form container in React. I used feature flag to decide whether to render the sign-in button component or not.
Determine feature flag variation. This is usually done by calling a function called evaluate(), we’ll explore later, but for now all we need to know is that this function can return “on/off” or “true/false” using some logic.
Use the variation as a condition in the if-then-else statement to determine which block of code to execute e.g. the if, or the else block. In our example, “on” or “true” would call the sign in button component.

let variation = evaluate("google-sign-in-btn")

if (variation === "on") {
    // code to render Google sign-in component
} else {
    // don’t show it
}

While this is how feature flags generally work, a complete feature flag platform like Unlaunch is much more comprehensive, as we’ll see later.

Let’s take a deep dive into the internals of feature flags and how it all works, including the evaluation logic.

How feature flags work?

Anatomy of a feature flag

A feature flag itself is nothing but an object or a container that contains the following key properties:

Name or unique identifier: The name of the feature flag must be unique from all other flags in the scope.
Variations: A feature flag can be 2 or more variations (multivariate.) Variations are simply strings e.g. “on” or “off”. These are supposed to be used by developers in “if” statements to select the code path. In some feature flagging systems, this could also be “true” or “false”.
(Targeting) Rules: These take in context such as the user attributes from HTTP request to determine which variation to return. For example, you can set targeting rules to enable a feature by returning “on” variation for new users only.

There can be many more properties of a feature flag such as whether it is enabled or disabled, dynamic configuration etc.

To recap: A feature flag is an object that contains a bunch of properties such as variations, targeting rules etc.

You can think of a feature flag JSON object describing its properties and state.

The second central component of a feature flagging system is called the evaluator.

Evaluator

Evaluator is a method that takes as input the:

feature flag object (which as we learned in the last section contains all the properties,) and,
context such as the user id and attributes from the HTTP request

The output of the evaluator is always the variation that developers should use in their code to determine the code path. As we learned, variations are strings such as “on” or “off” that are used as conditions in an if statement.

The evaluator uses these two arguments to choose which variation to return. It knows how to evaluate the targeting rules.

Side note: If you’re a Java developer, read this post on how to integrate and use feature flags in Spring Boot applications.

History of Feature Flags

Facebook engineering and research have built a number of amazing products. Apache Cassandra, React, GraphQL are great examples.

While they didn’t invent this technique, Facebook certainly pioneered and made heavy use of feature flags internally.

Facebook is a massive system of interconnected backend services, databases, frontend, interfaces and more. As they grew big, they faced a challenge. How to release new features, changes and updates without breaking things.

In other words, how can the massive social networking site enable thousands of engineers to get their code out quickly to users in safe, small and incremental steps.

While a lot has gone into powering continuous delivery at scale at Facebook, pushing out an endless stream of changes every single hour, it was made possible by an internal tool called Gatekeeper.

What is Gatekeeper? You might have already guessed it but it’s Facebook’s internal system for managing feature flags. Jack Lindamood, former Facebook engineer, described it as:

if (gatekeeper_allowed('my_feature_name', $viewing_user_or_application)) { 
  run_this_tested_code(); 
} else { 
  run_this_old_code(); 
} 

From Facebook engineering blog:

If we do find a problem, we can simply switch the gatekeeper off rather than revert back to a previous version or fix forward.

This quasi-continuous release cycle comes with several advantages:

It eliminates the need for hotfixes.… It allows better support for a global engineering team.… all engineers everywhere in the world can develop and deliver their code when it makes sense for them. It makes the user experience better, faster.… when it takes days or weeks to see how code will behave, engineers may have already moved on to something new. With continuous delivery, engineers don’t have to wait a week or longer to get feedback about a change they made. They can learn more quickly what doesn’t work, and deliver small enhancements as soon as they are ready instead of waiting for the next big release. …

While Facebook adopted many practices to power continuous delivery at its massive scale, feature flags were fundamental to their approach.

Today, feature flags are widely used at large and medium size companies.

But are they useful just to super large organizations like Facebook or can small companies and startups can benefit from them as well? We’ll discuss this a little later in this post, but the short answer is yes.

What’s in a name? That which we call a rose…

Feature flags are known by several other names in the development community: feature toggles or feature flippers and perhaps a few more. For the remainder of this post, we’d stick to the term feature flag as it’s the most popular.

Feature Flag Use Cases

Earlier I mentioned that feature flags are very versatile and can be used to achieve a variety of tasks. They are useful not just for engineering teams, but equally useful to QA, Operations and Product teams.

Let’s recap the key point that we have learned about feature flags so far:

Feature flags allow developers to modify the behavior of their code at runtime. In other words, they give the ability to ship multiple code paths and choose between them at runtime.

That’s incredibly powerful and can be used in a variety of contexts to achieve many different goals.

Canary Releases and Gradual Roll outs: Rapid Releases at Scale

This is perhaps the most common use case of feature flags that I have encountered.

Traditionally, product features were launched as all-or-nothing. Also known as big-bang releases, this meant making new features available to all users at some cut-off point or the release date. These types of releases require a lot of testing prior to launch and if something goes wrong later, the developers have to patch or get hotfixes out to resolve the bug.

In other words, when the feature is ready, you let it rip.

In canary releases, instead of launching the feature to all your users at the same time, it is released to a small number of users initially. This allows reviewing results such as user sentiment and more concrete metrics like system load, performance etc. Once satisfied, the feature can be launched to a wider group of users.

Canary releases limit the blast radius. If issues are discovered, only a small number of users are impacted. Canary releases have been in use for a long time. Traditionally, they are achieved by having a separate cluster of servers or a replica of production environment. The load balancer sends a small number of users to the cluster where the new version of the application is deployed. For example 2% of the traffic goes to new code, rest goes to the existing production environment.

You can also achieve canary launches using feature flags. In fact, feature flags bring several improvements over traditional canary releases.

Canary clusters are typically controlled by DevOps or Operations teams. It requires jumping through extra hops and coordination if a developer needs to roll back something.
If you do need to roll back a feature, all features in the canary cluster will be rolled back. Including good features that other teams may have deployed.
While canary releases are better than all-or-nothing releases, there are still discrete jumps: 0% to 2% to 100% that may not be ideal for all use cases.
Not all teams have the budget or resources to manage a separate canary cluster.

Feature flags put canary releases on steroids. If I compare traditional canary clusters to monoliths, feature flag based ‘canary’ releases are like microservices. They are small, decentralized and put control back in the hands of teams and developers.

Feature flags put control right back into the hands of developers. Developers can launch their features when they want it, disable just the feature that is misbehaving without impacting other features. And best of all, feature flags do not require building a complex and expensive infrastructure to do canary releases.

They also enabled a wonderful use case that I personally love: Gradual or Percentage based roll outs. This allows teams to slowly ramp up traffic to their feature in any ‘continuous’ increments they want.

For example, when we switched over to a new search backend (ElasticSearch,) we initially sent 1% of the traffic to it. Later we increased it to 5% and monitored for several days. We kept increasing it gradually. At 60% traffic, we discovered several issues. We instantly disabled the feature with a click of a button, fixed the issue and resumed again.

I have also seen some teams let their PMs release (low risk) features such as a new blog post on their site, awards lists, etc. in coordination with marketing.

Alpha testing and dogfooding

Dogfooding, aka eating your own dog food or drinking your own champagne, is the practice of using your own products within your company. It’s a great way to not only test products in practical, real-world situations for finding bugs, but also to first hand experience your product like your users.

While dogfooding should be an on-going endeavor, it is especially important in the context of new features or major changes and provides an early opportunity for internal users to use the feature and provide meaningful feedback before opening the floodgates.

Feature flags are great for establishing a culture of dogfooding within the company. Release features to production but only for internal teams and allow them to be the first users.

Continuous Delivery

Continuous delivery is not a new concept. Over time, our field has evolved and created many methodologies such as agile software developers, continuous integration, continuous delivery to name a few, with the goal of delivering new changes to their users faster, safer and higher quality.

The typical git-flow goes like this:

Developers create separate feature branches to work on their features
When the feature is complete, it is merged into the develop branch. This usually results in the feature being deployed to the central development or QA environment.
The code is then merged into the master or release branch and released to production.

While this was a great approach, it has a few challenges:

The feature branches can be long running. Even if the feature is complete, the team may have to wait for an external approval or another feature to launch before they can release their code.I have seen cases where feature branches have sit in isolation for weeks before they are merged into develop because they were dependent on the results of an AB Tests that hasn’t completed.
The feature cannot be released even internally until the feature branch is merged into the develop branch.
I have always held the belief that the best test environment is the production environment. I’m not saying everyone should test only on production, but it is the most accurate. It’s hard to emulate production traffic and patterns in test environments. Following git flow, because we can’t merge our features until ready, we can’t get them to production, even just for ourselves or the team.

Using feature flags, you can merge in-progress feature branches into develop or even main (release) branches. They allow unfinished features to be released to production. Not only this gets rid of the long running git branches, it also great for dogfooding, or showing off features to internal teams for testing or feedback.

For example, we used to regularly push almost-done features to production, enabling for internal users only by IP or email domain e.g. ‘@company.com’, and share links around (including the CEO and relevant stakeholders.) Everyone has access to the production environment and we instantly received feedback.

If you think about it, in git-flow, merging code is tightly coupled with releasing it to your users. When you merge to develop, it is released to everyone within the company. When you merge it to master, it is released to all users.

Feature flags provide a nice way of separating the two concerns: merging code is separated from when you release it. In other words, you can merge unfinished features all day long and only allow users to see it when ready.

AB Testing and Experimentation

Feature flags are great for running AB tests. You can define buckets (variations) and assign percentages to variations.

Product and marketing teams have been running AB tests for a long time. They are great for vetting new ideas quickly, and eliminating bad ones.

While the situation is improving, I don’t see engineers running a lot of experiments for technical features that they implement. AB tests are usually pitched by product teams.

I don’t blame engineers entirely - AB testing is a complex business especially for big product related features and requires deep analysis (usually by analytics or data science teams across many KPIs.)

But not everything requires that level of in-depth analysis. Sometimes, all that’s needed to call an experiment is the count of errors across the test buckets and controls.

At my last two jobs, we successfully ran many light-weight, engineering focused experiments. Things like response time impact across variants, error counts, etc. were commonly experimented upon. Feature flags made it easy to run these and quickly analyze the results.

Kill switches and performance knobs

Feature flags can be used to implement kill-switches. These allow, usually DevOps or Operation teams, to gracefully degrade non-essential functionality when under load or in the event of outages. For example, a website may disable upload functionality for some or all users if there’s too much load on the backend systems or the system is under attack. Kill switches are related to another concept called the circuit breaker which enables or disables a certain code path automatically. It can be thought of as a manual circuit breaker.

Performance knobs are similar to kill switches but instead of completely disabling a feature, they are used for throttling or rate limiting a feature or API for a certain class of users. For example, under extreme load, you could limit access to new users but allow old users to continue as normal. Migrations especially those that require coordination

Feature flags can be very useful when performing migrations especially those that require coordination.

Feature Flags Lifespan and Dynamism

We explored many use cases of feature flags. In his extensive post on feature flags, Pete Hodgson, created a nice graph of types of feature flags that he described vs their lifespan.

Feature flags fall in two categories with respect to their lifespan: temporary vs long-living or even permanent. A feature flag you’re using to roll out a new feature is temporary. When the feature is 100% rolled out, the flag can be deleted. On the other hand, kill switch to control whether to show or hide a heavy widget on your website is a permanent flag that will always be there.

The graph below shows lifespan - how long feature flags stay active in code vs dynamism. Dynamism describes how dynamic a feature flag is. At one end of the spectrum, you have flags like kill switches that don’t care about any context. At the other end, you have feature flags that depend on per-request parameters like type of user and user attributes.

How to feature flags?

There are many ways to implement and use feature flags.

The easiest way is to use hard-coded if-else statements in your code. This approach is not very flexible and defeats the purpose of feature flags (changes to feature state such as enable, disable or gradual rollout need code deploys.)

Taking it one step further, you can make the flag condition configurable. Such as adding it to a database where you can change it on the fly.

let enableLatestWidget = db.read();

if (enableLatestWidget) {
  // show widget
} else {
  // don’t show it
}

While this is better than the first approach and certainly works, it’s very brittle.

You can’t do things like gradual roll outs, targeting users based on their IDs or attributes, experiment or look at KPIs. There’s no way to look at all the feature flags laid out on a nice dashboard and collaborate with your team in one place.

At least not without writing a lot of code yourself.

Taking it even a step further brings us to complete feature flag management tools like Unlaunch.

Comparing managing feature flags using a database vs feature flag tools is like comparing Excel spreadsheet to Jira (or Trello.) Sure, you can track tasks in Excel, but it’s much more efficient to use Jira.

Arguably the biggest benefit these mature feature flag tools provide is encouraging feature flags and experimentation best practices and establishing a culture organization wise.

These tools have low overhead. For server-side applications, flags are fetched when application starts (and then refreshed periodically.) All evaluations occur in-memory, introducing no additional latency.

Unlaunch is a complete feature flag management tool that I have been working on since last year. It is free for solo developers and small teams.

Cleaning up after yourself

This is important enough to warrant its own section.

You must remove feature flags from your code once they are no longer needed. If you launched a new feature using feature flags, after it is rolled out at 100% and there’s no risk (or you can’t disable it anymore,) go remove the feature flag code from your code. Search by the name of the flag in your code and delete the flag code and code path that is not used anymore.

Feature Flags & Startups

Feature flags are very common in large to medium sized organizations and have been around for many years.

But I have not seen a whole lot feature flags use at smaller companies or startups.

Which raises the question: can startups and smaller companies reap from using feature flags or is it just needless overhead?

The answer is that startups could and absolutely should use feature flags.

Startups operate at a much faster pace than large organizations, by their nature. Feature flags are an ideal vehicle to safely and reliably keep releasing new features to your users. By adopting feature flags early, startups can:

build a culture of safe and fast releases.
use feature flags for Alpha testing or dogfooding their products. This is great for teams who don’t have the budget or resources to maintain a separate QA or development environment to test or stage their features.

Last year I was consulting for a small startup that was building Chatbots. They didn’t have enough budget to build and maintain a separate QA environment to test things out before they release to their users. So we built a simple feature flags based system that allowed its developers to release features to production but only allow access to internal teams (by phone number.)

Are you using feature flags in your organization release new features faster and with confidence? Please comment below to share any tips, ideas or best practices.

Tutorial on using Feature Flags in Java [Complete Example]

2020-11-22T00:00:00+00:00

Feature flags (or feature toggles) is a powerful technique used by modern software teams to manage the behavior of their code in production. Gone are the days when production releases happened occasionally and were a big event. These days, agile teams deliver changes to production continuously, sometimes releasing changes several times a day without any fanfare. Feature flags let teams release their changes to production and stay in full control over who gets to see the changes and when. Using feature flags, you can:

deploy changes to production but keep them hidden from everyone but internal users.
gradually roll out features to build confidence.
if things don’t look good (you discover errors), roll back instantly without any code changes to redeploys.
test your changes into production.

I’ve been using feature flags for the last 5 years and find them very useful in shipping changes quickly and confidently to millions of users, from infrastructure changes like DB upgrades to UI changes.

In this post, we’ll see how to use feature flags in Java using Spring Boot. Let’s get started.

But first, head over to Unlaunch and create a free account. After you have created a new account, come back and resume reading.

Challenge

Suppose you’re a software developer working on a backend service: User Profile. There is an API that returns user information from the database, combining it with information retrieved from some other services and sources. One day, you were reviewing the performance of the API and noticed it is very slow and takes up to a second to return response.

After some investigation and find the root cause. The delay is being introduced due to the way the service calls other services to get user’s status, open orders and pending cancelations: the calls are being made sequentially, that is, one after the other. You think you can improve the performance by parallelizing the calls.

In traditional software development, you’d create a new feature branch e.g. feature/JIRA-101-async-calls-in-user-profile. You’d make your code changes and keep committing to this branch. When everything looks good, you’d open a pull request to merge into the development branch. You’d do some more QA and performance testing the QA environment. Finally, your code will be merged into the main branch and your feature will go to staging and then to production. If you find any issues in production, you’d have to create a hotfix into main directly to rollback your changes.

Sounds stressful. If you think about the change you’ve identified, i.e. making asynchronous calls, it is not complicated. But the way software releases are done in large enterprises and the risk of breaking something even with a small change and impacting millions of users, there’s a lot of added friction and stress.

Enter feature flags. Let’s see how we’d utilize feature flags to solve performance issues and release with the peace of mind.

Create a new feature flag. Turn it off for everyone except yourself. You can do by user id, by environment, or any other criteria.
Create a new method which makes asynchronous calls to other services.
In the main method, check if the feature flag is on. 3a: If the feature flag is on: call the new method which makes async calls 3b: If the feature flag is off: Call the old method that you’re trying to replace.
Enable the feature flag in the QA environment. This will allow everyone using the QA environment to see your changes and test them.
Merge into the main branch if everything looks good. Enable the feature on your staging environment. Enable it for all internal teams in production but keep it hidden from your users.
Compare the performance (API response time) of your change to the old API. If everything looks great, gradually roll out the feature to real users.

Let’s see how we can do these steps and use feature flags to release our code to production quickly and confidently.

Solution

1. Set up Feature Flag

If you haven’t already done so, head over to https://app.unlaunch.io/signup and create a new account. Unlaunch is a free feature flag service. As part of account creation, you’d be asked to create a new project. A project is a collection of related feature flags. In this example, we’ll create a new project called “User Profile” as part of registration. By default, each project gets two environments: Production and Test. To keep things simple for this example, we’ll do all work in the Production environment. But separate environments allow you to test changes to feature flags independently.

Make sure that the Production environment is selected using the dropdown on the top left corner of the sidebar.

Figure 1 - Unlaunch Console

Next up, create a new feature. Follow the screenshot below. Here, we create a feature flag with two variations: on and off. In our code, we’ll check this flag. If on variation is returned, we’ll show the new feature (i.e. call the new method we just implemented.) Otherwise, we’ll hide the feature (i.e. use the old code.)

Figure 2 - Create a new flag

Next, we’ll enable the feature flag, that is, serve the “on” variation to only a single user for now. For everyone, else, we’ll serve the “off” variation. Please make sure your setup looks similar to the screenshot below. Also, don’t forget to enable the flag by clicking the green “Enable Flag” button.

Figure 3 - Turn the feature flag on for user id umer@gmail.com only

At this point, the flag setup is complete. Let’s take a look at how we’d set up the SDK to call this flag and show the new feature.

2. Call Feature Flag in Your Application

Now, let’s use this feature flag in our Spring Boot service to determine whether to use new or old algorithm.

Note: If you are looking for a plain Java example, please see to this repo.

First, add the Unlaunch Java SDK dependency in your pom.xml (or build.gradle)

<dependency>
	<groupId>io.unlaunch.sdk</groupId>
	<artifactId>unlaunch-java-sdk</artifactId>
	<version>0.0.8</version>
</dependency>

Next, you’ll need to provide the SDK key to connect to your Unlaunch project and environment. You can find the SDK key for your environment by clicking Settings in the sidebar. Choose the Server Key for the Production environment.

You can paste the SDK key in your application.yaml or make it available as an environment variable export UNLAUNCH_SERVER_KEY=<paste server key here>.

unlaunch:
  server:
    key: PASTE_SERVER_KEY_HERE_FROM_SETTINGS

2a. Initialize the Unlaunch Client

Next, we’ll create and initialize the UnlaunchClient as a bean. When you initialize UnlaunchClient, it will download all feature flags in memory. All evaluations will be done against in memory data store (e.g. HashMap) and are really fast.

@org.springframework.context.annotation.Configuration
public class Configuration {
    @Value("${unlaunch.server.key}")
    private String sdkKey;

    private static final Logger logger = LoggerFactory.getLogger(Configuration.class);

    @Bean
    public UnlaunchClient unlaunchClient()  {
        UnlaunchClient client = UnlaunchClient.builder().
                sdkKey(sdkKey).
                pollingInterval(30, TimeUnit.SECONDS).
                eventsFlushInterval(30, TimeUnit.SECONDS).
                eventsQueueSize(500).
                metricsFlushInterval(30, TimeUnit.SECONDS).
                metricsQueueSize(100).
                build();

        try {
            client.awaitUntilReady(2, TimeUnit.SECONDS);
            logger.info("unlaunch client is ready");
        } catch (InterruptedException | TimeoutException e) {
            logger.warn("client wasn't ready " + e.getMessage());
        }
        return client;
    }
}

2b. Evaluate the Flag to Choose New or Old Algorithm

Then in the UserService class, let’s decide whether to use the new algorithm or not.

@Service
public class UserService {

    @Autowired
    private UnlaunchClient unlaunchClient;

    public User getUserById(String id) {

        // Evaluate the feature flag
        String variation= unlaunchClient.getVariation("implement-async-calls", id);

        if (variation.equals("on")) {
            // Call the New algorithm if the flag returns: on
            return newAlgorithm(id);
        } else {
            // Call the Old algorithm if flag returns: off
            return oldAlgorithm(id);
        }
    }
}

One note: The getVariation(...) method will never throw an exception or return null. If there is an error such as if the flag is not found or the initial sync fails, it will return a special String value: control. So you don’t have to worry about adding any additional error or checks around this method and can safely use it everywhere.

3. Try it out

You can find the complete source code for this project on GitHub. You can clone and run it locally using mvn spring-boot:run or (gradlew bootRun).

To test the feature flag, call the API with any user id e.g. kelly@gmail.com. Because we disabled the flag for all users except umer@gmail.com, the old algorithm will return. http://localhost:8085/api/v1/users/kelly@gmail.com

Now let’s call the same API with umer@gmail.com which should return the on variation and hence the new algorithm. (See Figure 3.) http://localhost:8085/api/v1/users/umer@codeahoy.com

That’s all. If you enjoyed this post, please share it.

COVID-19 - Remote Work Policy by Companies

2020-03-15T00:00:00+00:00

In response to Coronavirus outbreak (COVID-19), many companies, including Amazon, Apple, Facebook, Google, Salesforce, and more, have been encouraging those workers who can do so to work from home. Some companies have even made it mandatory for employees to work remotely, shutting down their offices. A surprising number of companies have no plans to follow suit.

We have created these trackers to track remote work policy by employer and to call those employers out who aren’t allowing their employees remote work despite their employees being capable of performing their job duties remotely.

List of remote work policy by country

1. USA

2. Canada

3. India

Date: March 15 - March 17, 2020

You can help by filling out a completely anonymous survey to track companies, not just in the US but around the world, and their remote work policies.

About this data

The list is generated using reviews submitted by CodeAhoy readers. You’re free to use this data under the terms of cc-by-sa 4.0 with attribution required. A link to codeahoy.com will be sufficient.

If you’re an official from the company listed below and the information isn’t accurate or the policy has been updated, please email us at: name_of_this_blog@gmail.com or message us on Twitter.

(Fyi, name_of_this_blog = codeahoy)

COVID-19 Hiring Freeze - Company List

2020-03-15T00:00:00+00:00

Coronavirus disease (COVID-19) is adding a lot of uncertainty and anxiety for millions of people. Events, left, right, and center and being cancelled. Markets are being very volatile and stocks are dropping sharply.

Several companies have initiated hiring freezes. Some have frozen hiring for all positions while others have put freeze in place only for non-essential positions. Millions of people especially those who are on the job market will be affected by this. However, we at CodeAhoy believe that once this is over, there is a strong possibility that markets will jump back up and there will be a hiring frenzy.

To track hiring freezes across companies all over the world, we are asking our users to help build a list of companies that have put a freeze on hiring.

Contribute to this list: https://forms.gle/Xeog7LvySd2H7WQA9

This list was generated by reviews submitted by the users of CodeAhoy.com

Company Name	Official Policy	Country	Comments
Amazon	No	USA	Current Employee - -
Apple	No	USA	Current Employee - Apple is still hiring. No communication on freeze
Atlassian	No	USA	Current Employee - Not an employee but just got booked for an interview
Boeing	Total hiring freeze	USA	Current Employee - -
Capital One	No	US	Current Employee - Summer intern 2020
Citizen App	Partial freeze on some roles	USA	Current Employee - -
Google	No	USA	Current Employee - -
Heap	No	USA	Current Employee - -
IBM	No	Canada	Current Employee - -
Informa	Partial freeze on some roles	USA	Current Employee - All hiring needs to be approved by ceo and a special committee
Microsoft	No	USA	Current Employee - -
Nordstrom	No	USA	Current Employee - -
Reonomy	Total hiring freeze	USA	Current Employee - -
Sezzle	Total hiring freeze	Usa	Current Employee - -
Skyscanner	Total hiring freeze	Austria	Current Employee - -
SpotOn	Total hiring freeze	usa	Current Employee - -
Starbucks	Total hiring freeze	Canada	Current Employee - Last week I was offered a position and set up a time to begin orientation. Last night I received a ...call from that same store manager advising that while the position is still mine they are freezing the hiring process for the time being as per instruction from Starbucks corporate.Show More >
TripAdvisor	Partial freeze on some roles	USA	Current Employee - -
Value Village	Partial freeze on some roles	Canada	Current Employee - -
Walgreens	No	USA	Current Employee - Walgreens is currently hiring full time, part time, and temporary CSAs, Pharmacy Techs, Shift Leads, and Beauty Consultants
Yelp	Partial freeze on some roles	USA	Current Employee - -

Tech Debt Developer Survey Results 2020 - Impact on Retention

2020-02-17T00:00:00+00:00

Last month, I wrote a blog post called Technical Debt is Soul-crushing. In it, I discussed the effects of tech debt on software developers and how it makes them unhappy. I wrote it because, being in management for several years now, I have seen how people discuss tech debt as a product or engineering problem. They completely overlook or ignore its impact on people.

For software developers, it is very frustrating to work on a codebase that has a high amount of tech debt. They feel unproductive and handicapped. This creates an atmosphere where people start thinking about leaving and they do soon as they find a better option.

Because the blog post was getting good amount of traffic (10k views in a week), I decided to add a survey. The survey is closed and the results are in.

68% of Developers Said They Work on Products with High or Very High Amounts of Tech Debt

No large software is tech debt free. At least, I haven’t come across one in my life. Some have more, others have less.

Not a single person said that their product contains ‘No tech debt’.

50% of Developers Are Likely or ‘Very Likely’ to Leave Their Jobs Because of Tech Debt

27% percent indicated that they think about it, but aren’t sure.

Question: Is Your Management Aware of Tech Debt and Are They Taking Action to Pay It Off?

I asked this question to see if there’s a correlation between developer dissatisfaction and management or leadership being aware of the problem and taking action to pay it off. Here are the results.

Question: How Long Have You Been Working on the Product with Tech Debt?

Here are the results.

Survey Methodology

117 software developers from all over the world took the survey. The majority were from the USA, followed by Canada, Australia, Germany, India, Russia , UK and other European countries.

The survey ran in February 2020.

91 software developers took the web survey. 26 respondents were from my personal contacts. Senior or Lead software developers mainly from the USA and Canada who use Java. I limited my contacts as not to introduce sample selection bias.

Before I continue further and look at some comments from the survey, if you enjoyed this post, please share it on Twitter, Facebook and other sites. It helps me grow. The links to share are at the bottom.

Types of Tech Debt

I asked survey participants to identify the type of tech debt present in their project. Here are some of their responses.

Poor code, outdated libraries, lack of documentation
Outdated libraries, a mix of design patterns, poorly designed code, obviously-slower-than-it-should-be performance
bloated monolith - lack of processes, no time allocated for reducing tech debt - lack of documentation, onboarding is not trivial - low amounts of automated testing - compiler warnings mostly ignored - static analysis tools not included in the dev process. running such tools on the codebase produce non-trivial warnings (that are not related to readability)
Poor code, bad design choices, lack of unit tests, no documentation, haphazard architecture
Outdated
no design no docs no tests no separation to concerns no competence
legacy code base
Poor design choices as the product evolved, mostly from a technical perspective dealing with the divide/chasm between frontend & backend, where and how is the data being transmitted.
poor coding practices, outdated libraries (largely from silly backwards compatibility requirements)
Monolithic processes, lack of documentation, inconsistencies across similar parts of the product, lack of shared code leading to mass code duplication
tech debt helps me keep my job
Poor code & design choices, outdated libraries, barely maintained logging & monitoring solutions, lack of documentation, terrible deployment strategy, lack of tests (especially integration tests)
Large monolith, lack of continuous delivery, vague processes and no central authority for organizing deployments of the monolith, consistent failure to investment in hiring enough people to support the work
Design docs written without background research, some design decisions made without a design doc
Zero tests, no documentation, massive code duplication, hardcoded IDs in code as hacks
Inconsistent use of designs/patterns across code base. Lacking alignment to broader architectural goals.
monolith, not implemented design principles (especially SRP and OCP from SOLID), outdated libraries, lack of process, lack of documentation
Bloated monolith, cowboy programming methods, hard-coded stuff everywhere, outdated programming languages and frameworks, lack of process
outdate languages and frameworks, monolithic structure, coupling between components that should not be connected
Building an in-house UI framework built on another external framework that’s not supported anymore.
bloated monolith, poor code or design choices, outdated libraries - terraform, lack of processes or documentation, Left over AWS infrastructure for services not used.
in-house frameworks flaky tests and code (sic!) unstable test hardware (the setup is done poorly), often redesigning code that wasn’t even updated to the previous design
Bloated monolith, poor code and design choices, far too much “duct tape” accumulated over decades.
lack of documentation, lots of invalid inconsistent data in the database, lots of code which is (presumably) not used anymore with no easy way to be sure.
Framework built 10 years ago in house
Poor abstractions. Architecture requiring substantial boilerplate.
Complicated event & callback system, varying programming styles, lack or processes.
lack of processes or documentation, poor SE practices, no code review

Other Comments from the Survey

I asked readers to leave comments on the survey. Here are some of their comments.

Management actively acknowledges it and we even prioritize among technical debt, but the technical debt items never make it into work streams because it’s never seen as important enough.
I actually left because of their inabiiity to make a choice
Senior developers are the problem. They added tech debt because it is their job security
it’s hard to tackle the tech debt when most of the engineers are not very good. they simply don’t know how to write better code, and don’t always see the virtue in better designs when they’re presented. they also don’t know their tools very well (git rebasing and automated refactoring are foreign to them).
One of the curious things I’ve noticed is that our company is struggling financially, which means addressing tech debt is de-prioritized over new features that might bring in more revenue. However, I think a lot of our issues are related to our tech debt (such as systems being degraded over the weekend without us noticing due to poor monitoring). For whatever reason, new features keeps winning over fixing the issues that cost us users.
Old services and endpoints are shut down when no longer needed; logging integration APIs record actual production usage.
Automated tools help find dead code and refactor, but doesn’t write documentation
I consider tech debt a challenge - I have not been in the project since the beginning, but I do believe we can solve (maybe without going too much in the direction of your other article “overhours”)
Upper management is very aware of the crippling technical debt load our primary product contains, as it leads directly to numerous bugs – typically found by customers – that are very difficult to track down and fix. However, the immediate dev management doesn’t see that there’s a problem, so nothing is being done to rectify the issue.
Product features take priority and management is clueless how to fix it

Here are some comments I liked from other sites.

The system I’m working on is indeed soul-crushingly burdened. Everyone knows this: there’s a survey every year and “crushing technical debt” is always the top complaint. I mention it to the bosses, and they go “yeah, we know.” I say we never seem to schedule time to fix it, and I’m told “you’re not in the same meetings we are.” Yeah, but I’m the one that would be fixing it, if we were fixing it, eh?

Now I’ve started actually dealing with the management by treating the debt like it’s impactful.

“How long will it take to do X?” “I can’t predict. The thing that should have taken a day last time took three weeks.”

“Why did the system break?” “Shitty architecture from the start mandated making changes that are impossible to test, and our tests are so flakey it’s impossible to tell whether anything’s broken, so we pushed it out to prod and waited to see who complained about which feature.”

Without something like that, management never actually feels the pain, never understands that it’s not just something hard they’re asking for, but an interlocking tangle of shortcuts that make things 100x as hard to accomplish as they should be.

Oh, I understand this so much. . I am A data analyst who does a lot of programming. We automate Excel spreadsheets and databases tools and stuff using VBA and Python and SQL. The guy before me was not a programmer at all as is often the case in people who develop automated Excel spreadsheets with VBA. Every tool that the client has asked me to update is so horrible that I can’t even Trace through the code To figure out where I can insert a feature. It’s like a puzzle that has no logical solution.

Some of the code has variables that are sprinkled all over that have the same name and then he just repurposes them with new names. I swear all over the code There are blocks that look like this:

A=1

B=A

C=B

Instead of using modular functions and subroutines, every single function is repeated over and over with small modifications made. Because of that the code base is probably 30 times longer than it should be.

The worst part is throughout the code there are comments saying this should be fixed or that should be made better, but it was never done.

I like my job for several reasons, but this code I have to work with causes me to think about quitting probably every month or two.

Portal Theme and Blog Redesign

2020-02-15T00:00:00+00:00

CodeAhoy website has been redesigned!

When I launched CodeAhoy in 2016, I wanted to keep things simple and clean. My previous blog on Wordpress had become a cluttered mess and was really slow, especially on mobile devices. I was also bored with Wordpress and wanted to try something different.

Then I stumbled upon Jekyll. For those of you who aren’t familiar with it, Jekyll is a super easy static site generator for blogs. It takes your markdown files (and some configuration) and spits out a complete HTML website. The icing on the cake was that GitHub Pages supported Jekyll as one of the blog engines, which makes sense because Jekyll is written by one of the co-founders of GitHub.

So it was a no brainer. I chose Jekyll.

The next thing I needed was a theme. Jekyll didn’t have the same number and variety of themes and plugins available like Wordpress. But I did end up finding a theme that I liked. The theme itself was pretty simple, two-column layout. One column for the left sidebar, which always stayed in place, and the second displayed the main content, i.e., the blog post. It was clean, and did I already mention simple?

After applying the theme and releasing my blog, this was the end result.

I liked it because of the focus on readability. But over time, my affinity for the theme was reduced to a glimmer. What did I not like about the design? Well for starters, it wasn’t mobile-friendly. (Or may, it got so after my tweaks to the CSS… I don’t remember.) On mobile devices, the sidebar moves to the top of the page, taking up half the screen real-estate.

The second challenge I had once I had a good number of posts on my blog was that there was no room to display meta content like new or related blog posts and ads on the page. I was not too fond of the idea of embedding these things in the blog post itself. What I wanted was a thin right sidebar to display these things. I made up my mind that I’ll change the layout but got busy with other things so the idea collected dust.

Finally, I decided it is time to revamp the design finally. It was beginning to look pretty ancient and mobile usability was just horrible. So on the President’s Day long weekend, I made it my goal to get it done finally. My requirements from the new design were:

mobile-first
keep the focus on readability. Put content first.
Less cruft for improved page load speeds.
For blog posts, I wanted a layout that has: a top navbar, and two-column layout for the main area. The left column will be much bigger and I will use it to show blog posts. The smaller right column is where I’d display meta content like features posts etc.

With the new requirements in mind, I set off to find the right Jekyll theme. I imagined that after so many years and improved Jekyll adoption courtesy of GitHub, I’d have better luck at finding the right theme quickly. I was wrong.

I searched for many hours. I looked for both paid and unpaid themes. I couldn’t find anything that met my criteria. I was contemplating just giving up and to build my own theme from scratch (Bootstrap,) I decided to go through my shortlist of candidates. On second look, a Medium-styled theme called Mundana really stood out. The design was clean and very close to what I wanted. It was missing a right sidebar that I wanted, plus it had some cruft that I wanted to get it out. I also didn’t like animations and transitions so much. I checked its license, and since it was MIT, I made up my mind to tweak it to meet my requirements.

Plug. Mundana is built by WowThemes. These guys have some really cool Jekyll (and Wordpress) theme. Check them out.

Back to the redesign story. Actually, I’ll just give you a summary: it has been a while since I used Bootstrap, which I didn’t realize when I thought I’ll be able to finish the redesign in a day. It ended up taking twice as long and I’m still not done. Thankfully, I remembered how flexboxes work and that was my saving grace. This is what the final result looked like:

Am I happy with it? Not really. It doesn’t look bad but it can be improved. The Jumbotron displaying the blog post title, author and featured image doesn’t feel right and it is something I plan on fixing next. I also want to improve the theme’s performance by removing cruft like JQuery, and other CSS/JS files I don’t need. I also want to use async and defer for JS file to improve first component paint times. I want the navbar to be stick on mobile. A lot of work but I’ll chip away.

I’m making my changes available for free for anyone to use. If you want to contribute, you’re welcome to do so! If you’ve any questions about the theme, feel free to message me (Twitter).

World, meet Portal.

Portal Jekyll Theme

Portal is a mobile-first Jekyll theme for technical blogs and websites. The source code is available at: https://github.com/umermansoor/Portal-Jekyll-Theme

Features

Modified Boostrap 4.1 theme
Complete Jekyll integration
About Me and Author pages
Sitemap, Feed and Atom
robots.txt
Easy mailchimp integration for newsletter sign-ups
Easy Disqus integration

How to use

These instructions assume you have Jekyll installed on your machine.

Clone the git repo

git clone https://github.com/umermansoor/Portal-Jekyll-Theme.git

Start the theme with sample blog contents

jekyll serve --watch

If everything works as expected, you should be able to see it by visiting http://127.0.0.1:4000/index.html. I have added some sample blog posts and content to illustrate the design.

Technical Debt is Soul-crushing

2020-01-25T00:00:00+00:00

Technical debt is incurred when the software or system designers take shortcuts to ship a feature faster, increasing the overall complexity of the system. The goal is to optimize the present rather than the future. In other words, it’s the easier path that takes us to the end-goal faster, but the resulting code (or design) is messy and complicated. It will require extra time in the future to add new features or to fix bugs.

The most common reason why companies take on technical debt is to meet the time to market demands. “We must release this feature by February, or our revenue will take a big hit. Just hack it for now, and we’ll fix it later.” Other reasons for incurring tech debt include lousy design choices, poor programming, changing requirements, or the presence of outdated libraries or frameworks that made sense in the past but have become a liability now.

Technical debt is not always a bad thing. It can help companies ship a critical feature fast and acquire users more quickly than its competition. My first job was at a startup. We intentionally took on tech debt because a) what we were doing was risky, b) we had a tight deadline to meet or the company would run out of money - no point in writing perfect code if it wasn’t going to be ever released. Our tech debt wasn’t the opposite of over-engineering. It was an intentional compromise to get the product out of the door on time. We understood that we’d have to pay the debt off or it will make it difficult to maintain and grow the system in the future.

Tech Debt is Demoralizing for Software Developers

The problem starts when companies forget to pay off the debt and let it creep and pile up for an extended period. The past comes to haunt the present. For good software developers, it is totally demoralizing to work on products that have high tech debt. This aspect isn’t often talked about, but its effects are very real. Simple things like changing a title tag of a webpage page take up a whole day because the logic was scattered in five different files. At the end of the day, it’s not a great feeling that it took so much time for a small task. It’s even more upsetting when they have to explain it to their managers, colleagues, or the product team why it took so long. Troubleshooting a bug is not just difficult but also painful. Jeff Atwood called it a major disincentive to work on a project:

Beyond what Steve describes here, I’d also argue that accumulated technical debt becomes a major disincentive to work on a project. It’s a collection of small but annoying things that you have to deal with every time you sit down to write code. But it’s exactly these small annoyances, this sand grinding away in the gears of your workday, that eventually causes you to stop enjoying the project.

I’ll add that it is even more frustrating if it is a large project with many other developers working on it. Imagine finally finishing a feature only to find that it is not working on the QA or Stage environment because another developer from a different team changed something which somehow broke the feature you just completed and you both have no idea how to get out of the mess. It can also fester and promote intra-team conflicts and dissatisfaction.

In addition to just being frustrating, good software developers have a burning desire to keep their skills up-to-date. Systems with high tech debt are very difficult to work with, much less keep up to date. As a result, developers are stuck with libraries and frameworks that are many years old. They fear they will break something, so they avoid upgrading it altogether.

Cost of Tech Debt: Employee Turnover

The real cost of all this is turnover. Good developers leave when they believe it is going nowhere. There is no sense of accomplishment, just frustration. There is worrying about the future since they aren’t learning anything new or acquiring new skills. They leave when they find a better opportunity, leaving behind those who are at peace with taking shortcuts or stuck working on the crummy project.

The people who are left behind do not voice their concerns as vocally as those who left and they continue increasing the tech debt. They have learned to survive with the beast and are at peace, taking the quick-and-dirty path all the time. One could also argue that the messy codebase encourages further poor practices and software rot owing to the broken window theory.

I’d argue that for organizations with high tech debt, it is more expensive to replace software developers than those with a lower amount of debt. When someone leave, they end up taking a chunk of tribal knowledge of code and processes with them. New hires take a long time to understand and get up to speed. It takes them a long time before they become productive.

How to fix it?

Good leaders and managers must evaluate tech debt regularly and take proactive steps to keep it under control. The worst thing senior technical leadership can do is deny the existence or effects of tech debt if it is present. The second worst thing to do is to ignore it and take no action if an action is needed. Those who do this are being short-sighted because they will soon be drowning in problems (if they stay.)

Let’s look at some strategies to paying off the debt. This is in no way a comprehensive list and based on my personal experiences.

A common strategy that I have seen work is setting aside a percentage of time each quarter or sprint to gradually pay off tech debt along with new feature development. Anywhere from 15% to 30% is reasonable, depending on how bad it is. This strategy works great if the tech debt isn’t out of control.

If the debt is out of control, then it must be attacked head-on and prioritized. A good strategy is to acknowledge the problem and let the software team know that it will be addressed. For large or medium organizations, it is important that all senior leaders are on board. For example, if the product or the marketing team don’t see the value or believe in it, they’ll continue pushing for new features that will result in a poor dynamic and make it fail. The debt metaphor works really well on non-technical people and helps them understand the problem. After the buy-in across the board, invite senior engineers to come up with a solution.

Start by identifying tech debt in your system or code. Resist the urge to define solutions at this stage. Create a shared spreadsheet identifying classes, methods, entire modules or services, use cases, dead feature flags or A/B tests, etc. in your code.
Go through the items that are identified above and extract common themes. Prioritize these themes such that most painful ones are taken care of first. Define clear goals. The goal shouldn’t be to pay off all tech debt but to achieve a reasonable ratio. You can choose to leave some of it in there or until the next phase.
Discuss steps but avoid the urge to rewrite from scratch. Inexperienced developers love total rewrites. Resist the urge and instead identify incremental improvements.
Be open to retiring systems that aren’t useful or needed anymore. When a system is retired, all of its technical debt is retired with it. At my previous (mobile gaming) company, we had a few very old central services (backend) that had a ton of tech debt and troubleshooting them every now and then was very time-consuming. These services were used by older game clients and had become irrelevant for business purposes. In theory, we could shut these down without any impact. But it wasn’t technically possible because doing so would break the API calls and impact the game and the user (forced updates of game clients wasn’t an option), so we came up with a creative way. We retired these services and replaced them with a new ‘mock’ service that always returned a mock (successful) response.
Assign action items to engineers and teams. Make sure they are empowered and that there is a clear execution plan.
Changing existing habits. This might be a little harder, especially with engineers who have been with the company for a long time and are used to just hacking it in even when it is not required. Coaching helps. Re-iterate the debt minimization vision and promote best practices like writing unit tests etc. for all new features.
Devise a strategy to track new tech debt. An easy and common technique is placing TODO comments with a description e.g. TODO - move this logic to XYZ or TODO: setup and use internationalized string instead of hardcoded
Make it visible and a priority. Follow up regularly in check-ins and re-adjust as required.

This process is an art rather than science. The most important thing is to identify a path that allows tech debt to be paid off gradually, and also allows new feature development or bug fixes to happen in parallel (albeit with some compromises.) Even the most complex systems can be gradually retired. I’ve found reverse-proxy’ing to be an effective technique in which some requests or parts of requests are redirected to newer services and older services are retired slowly.

If you’re a leader, you must assess the tech debt situation in your organization. Early detection and prevention will go a long way. Don’t ignore it if it has got too big and your developers are complaining and stressed out. It the situation isn’t too bad, schedule time to pay it off periodically. If it is out of control, make it an immediate priority. You might get away with it in the short term, but it will come back to bite you and feature development will slow down to snail’s pace.

I ran a survey to understand the impact of tech debt on employee retention. Please visit this link to see the results.

Do Software Developers Normally Code on Weekends? Work-life Balance and Overtime in the Tech Industry

2019-10-19T00:00:00+00:00

Despite all the perks tech companies offer to employees these days (unlimited PTOs, catered lunches, refreshments, beer on tap, dog friendly offices, gyms, yoga classes), there is often an indirect or hidden pressure on software engineers to put in more hours or to code on weekends. Managers say things like:

“You can call me on the weekend or even at night if you have any questions. I will answer your phone because I’ll be working anyways.”
“Sam is an amazing engineer. So dedicated and such a hardworker. He couldn’t finish his tasks because he was blocked for two days by the other team so he spent all weekend coding.”
“I’m so proud of my team. They regularly stayed late and worked weekends to deliver on our roadmap.”
“We’ll be using the new technology XYZ for this project. Why don’t you start learning XYZ in your ‘free time’ (aka the weekend)?”

Subtle hints like these leave many developers wondering whether working on weekends a normal thing that is expected from us as software engineers.

Which brings us to the question: “Is it OK to not code for work on the weekends?”

The answer is: it is ABSOLUTELY OK to not work on weekends. You have a life and it’s pefectly OK to enjoy it.

One exception to this rule is if there is a real emergency. No, not finishing a story in time for the Sprint Retro doesn’t constitute an emergency. It was probably not estimated correctly and besides, estimates are not targets. Emergencies should be rare or there are bigger issues with the team or the company. For example, a while ago, I was handling the launch of a product (a mobile game) that was attached to PR events. It was supposed to go live on Monday. On Friday afternoon before the launch, we discovered a major issue that was going to seriously jeopardize the launch if not fixed. This was a real emergency. The team came into work both on Saturday and Sunday. We fixed the issue just in time for launch and it went super smooth. Next week, the entire team took 2 extra days off and got a 4-day weekend. It was followed by a post-mortem analysis to understand the root cause so we don’t have make the same mistake again.

Is It OK to Learn New Technologies My Job Requires at Work?

If your work requires you to learn a new technology, it’s quite ok to learn it at work. If you are required to learn it in order to apply it at your job, then by definition it is work. This is where most managers either turn a blind eye or encourage employees to learn on their ‘free time’. If you’re in a situation like this, work with your manager to carve out dedicated time each sprint (or a week, whatever) where you can spend time learning the new technology you’ve been asked to learn. This is not a black or white rule. If you’re really interested in the learning the new technology and want to put in your time because it is good for your personal development and growth, then by all means you should to invest your free time in learning. The point I’m trying to make is that you should not be pressured into doing it. Similarly, if you really enjoy coding and want to take your skills to the next level by doing work on weekends, that’s totally fine as long as it is your choice.

How to be Effective at Work?

If you don’t want to put overtime, you must become efficient while at work and avoid distractions. As an engineer, if you are attending a lot of useless meetings a week, talk to your manager or the person who owns the meeting to see if you really need to be in those meetings. Try to avoid wasting time with people who come to work to socialize with others and then stay late to complete their tasks. Use headphones. If the socializers still don’t get the signal, walk away and tell them you’re busy. Work from home once a day every week where you can focus distraction free. Talk to your management team to implement no meeting days where other than daily stand-ups, no other meetings are allowed (this will usually require support from executives if you are working for a large company because it needs to be understood and accepted company wide.)

The Myth of Overtime

Putting in extra hours does work in the short term and will get the job done. But it’s not a viable long term strategy. In fact, the effects of overtime either cancel out the progress made in the longer term or leave teams worse off than they were before. Peopleware, one of my favourite management books of all time, covers this issue very nicely and uses the sprinting analogy: it’s great for the final stretch but if you do it for too long, you won’t be able to finish the race. Wise managers use it judiciously and think in terms of benefits divided by costs. The benefit could be that project is delivered on time. The cost will be downtime and decreased productivity or worse, replacing smart people who quit (which is almost never worth it!)

What About Workaholics?

It’s totally true that some people are workaholics. If you ask them about Work-Life balance, don’t be surprised if you hear ‘What’s that? They’ll consistently put in very long hours even when there’s no pressure on them. Under pressure, they’d even sleep at the office. It might be hard to believe for some, but I know regular employees who worked on Christmas holidays and were at the office every single day because they wanted to deliver on their commitments for the quarter.

But the productivity of workaholics decline after a while. Sooner or later, they’ll realize that life is going by while they were sacrificing it for work. Other people were enjoying it. They’ll become grumpy. Develop health issues. Burn out. Mentally check out or quit.

Chapter 3 of Peopleware, “Vienna waits for you” starts with the following tale:

Some years ago I was swapping war stories with the manager of a large project in southern California. He began to refate the effect that his project and its crazy hours had had on his staff. There were two divorces that he could trace directly to the overtime his people were putting in, and one of his worker’s kids had gotten into some kind of trouble with drugs, probably because his father had been too busy for parenting during the past year. Finally there had been the nervous breakdown of the test team leader.

As he continued through these horrors, I began to realize that in his own strange way, the man was bragging. You might suspect that with another divorce or two and a suicide, the project would have been a complete success, at least in his eyes.

If you’re an employee and your teammate is a workaholic, don’t feel bad. It doesn’t mean you have to do the same.

If you’re a manager, you need to keep an eye to detect problems early and course-correct. I was talking to a friend who quit his job a couple of months ago. His manager would spend time at the office socializing and then regularly send emails on weekend cc’ing higher ups asking for updates from the team. None of the manager’s bosses stepped in to say “what are you doing?” I’m glad he quit and now is in a place with better culture and working conditions. As a manager, I try to keep a close eye on overtime. My priority is to prevent burn out. There are times when we are running behind it gets tempting to push the team to put in extra hours. But there are options 98% of the time: work with the product team to negotiate the scope to reduce it so it can be released on time. Find other creative solutions. And if nothing else is possible, find out a way to make roadmap and OKR adjustments if necessary and push things out. 98% of the time, it’s not worth doing at the cost of burnout.

A Case Against Workoholics?

I’m in no way suggesting that worholics are bad people. There are times when companies need workoholics to succeed. Especially startups in their infancy. But it’s not a scalable approach. If your company can only be run by workholics, you’re in trouble.

So it’s not so black and white. If you’re starting out or learning a new technology, it’s totally OK to put in extra hours initially so you could have it easy down the road. I’m not against it at all; This post is about companies that try to promote this culture and pressure employees into following it. That is not OK!

If you manager tries to indirectly pressure you into working weekend regularly by saying things like “you can call me on Saturday (or at midnight) if you have any questions, I’ll be there to answer since I’ll be working anyways” you can reply “No, thanks. I don’t work on weekends as a principle.” If that doesn’t work and your manager or peers continue to pressure you, start looking for a new job which offers better work-life balance. Seriously, it is not worth it. Don’t listen to the BS work-life harmony crap and don’t let corporations control and define your quality of life.

See you next time.

GraphQL - A Practical Overview and Hands-On Tutorial

2019-10-13T00:00:00+00:00

GraphQL is a Query Language for APIs. It provides a fresh and modern approach to fetching (and manipulating) data from APIs when compared to traditional methods such as REST. Its power comes from the ability to let clients talk to a single endpoint and specify precisely what data they need. That’s very powerful indeed.

This blog post is a hands-on introduction to GraphQL and its important features. When I first encountered GraphQL (we’re switching public REST APIs to GraphQL,) coming from REST background, I was baffled. ‘Where’s the list of API end-points?’, ‘Is the list of fields documented somewhere?’ So I decided to write this article not as a comprehensive overview that deep-dives into internals, but rather to give an understanding of what GraphQL is and how to use it using real examples. It assumes no previous knowledge of GraphQL. Let’s get started.

GraphQL - An Overview

GraphQL - or more specifically - the data query language specification and the runtime for it - was developed by Facebook. It was opened sourced it in 2015, after a few years of internal use at Facebook. What inspired the need for GraphQL? Here’s an excerpt from the Facebook engineering blog:

As we transitioned to natively implemented models and views, we found ourselves for the first time needing an API data version of News Feed — which up until that point had only been delivered as HTML. We evaluated our options for delivering News Feed data to our mobile apps, including RESTful server resources … We were frustrated with the differences between the data we wanted to use in our apps and the server queries they required.

Once you build and expose a REST API, it’s pretty rigid. For example, suppose we have built a News Feed RESTful API which is returning 10 attributes for each item in the news feed. Down the road, you are building a mobile app for low-tier devices that don’t have the screen real-estate to show the news feed in all its glory. So instead of showing all 10 attributes for each news feed item, you resort to showing just 5. This is wasteful because you’re still fetching all 10 attributes. The typical solution would be to go to the RESTful API development (backend) team and ask them to make the fields you are not using optional. But this is not straight-forward because there are other clients out there and you might introduce a breaking change. They could introduce a new version or do something else. But the point I’m trying to illustrate is that the process is not free of friction and is inefficient.

GraphQL solves this exact problem by putting a lot of power in the hands of client developers. The basic premise is that the clients can always describe what data they need. Continuing the Facebook engineering blog post:

There was also a considerable amount of code to write on both the server to prepare the data and on the client to parse it. This frustration inspired a few of us to start the project that ultimately became GraphQL. GraphQL was our opportunity to rethink mobile app data-fetching from the perspective of product designers and developers. It moved the focus of development to the client apps, where designers and developers spend their time and attention.

GraphQL has a strong community behind it that supports many languages including Java, Javascript, C#, Python, etc. It’s used by many companies and teams of all sizes including Facebook, Pinterest, GitHub, Yelp, and many others.

Playing with GraphQL in Browser: GraphiQL

Let’s get hands-on and learn GraphQL by trying it out, thanks to the nice folks at GitHub who have made public their GraphQL APIs. What’s even better is that we don’t need to install any command line tools or do anything special. All you need is a web-browser and a free GitHub account. To run queries on GitHub’s GraphQL, we’ll be using a tool called GraphiQL (pronounciation). Think of it like Swagger for RESTful APIs.

A graphical interactive in-browser GraphQL IDE.

GraphiQL is easy to integrate and is used a lot by teams when working with GraphQL in development and pre-production environments. I use it a lot to explore queries, find issues, etc. before moving on to implementation. Goes without saying, but you shouldn’t enable it on your public servers unless you’re exposing a public API like GitHub.

Alright, without further ado, let’s head over GitHub’s GraphiQL endpoint and open it in a new tab so you can try out the examples. Here’s the link to GitHub GraphQL Explorer. (They have customized the UI a bit but it’s GraphiQL under the hood.)

First GraphQL Query

Type the following query and hit the play button in the top right corner or press Ctrl-Enter to execute the query.

{
  viewer {
    name
    isEmployee
    location
  }
}

Congratulations! You just ran our first GraphQL query. The data GraphQL returned is in JSON format and has the same shape as the query. This is an important concept in GraphQL: the shape of the response (query result) closely matches the result. The syntax of the query is custom to GraphQL and corresponds to its schema which we’ll explore in the next section.

In the query we just ran, we asked the GraphQL server to return three fields (name etc.) for an object called viewer which represents your GitHub account. We could easily ask for more fields as shown in the example below where I added three new fields to the same query:

Hint: GraphiQL will autocomplete fields as you start typing. Try out other parameters that are available.

GraphQL Schema

The schema is the main part of any GraphQL implementation. Schemas are written using what’s known as the Schema Definition Language or SDL for short. SDL is human readable and while it might look like Javascript, it’s not. The syntax doesn’t correspond to any one programming language which makes it language-agnostic for good reasons. SDL describes all the fields, arguments, and functionality that is available to clients.

Fields

Fields are the basic unit of data in GraphQL and the center of its universe. According to the official documentation:

the GraphQL query language is basically about selecting fields on objects.

Reviewig the query we just wrote:

{
  viewer {
    name
    isEmployee
    location
  }
}

We start with the special “root” object.
We select an object viewer on the query which represents your Github account.
For the object returned by viewer, we select the name, isEmployee and location fields to be returned.

Fields have types. The following types are currently supported:

Integer
Float
Boolean
String
Enum
List
Object (custom types)

GitHub GraphQL implementation makes available the following objects that are commonly associated with GitHub. We can query these objects and ask GitHub to return fields that we need, as we did for the viewer object in the last example.

Viewer
Repositories
Users
Organizations
Issues
etc.

For more information on schema and the Schema Definition Language (SDL,) you can go to this tutorial.

Viewing API Documentation

GraphiQL makes it very easy to explore the schema to get information on what queries it support. It comes with a built-in documentation tool called Documentation Explorer which shows you all the available types, fields, arguments, available fields and more. It’s a pretty kick-ass feature and I like it a lot. To open the Document Explorer, click the < Docs icon typically on the top left side. This will open a new pane. Type repository in the search field to explore what arguments it takes and which fields it returns.

In addition to using Document Explorer to explore the schema, you could also query it schema directly. This is useful if you are not using the GraphiQL interface and need an alternate way to look at the schema. __schema and __type fields are used for this purpose. This is called introspection and you can read more about it here if you are interested.

GraphQL Query Arguments

In the last couple of queries we ran, we just asked GraphQL to return as some fields e.g. name or location without passing any arguments. It’s possible to pass arguments to a query. Let’s take a look at another query, one that requires arguments.

{
  repository(owner: "umermansoor", name: "microservices") {
    name
    nameWithOwner
    description    
    # Nested field
    stargazers {
      totalCount
    }
  }
}

In the above query, we passed two arguments: owner and name to the repository object of type Repository. This object returns information about a GitHub repository. In this example, I passed my Github username umansoor as the owner and microservices as the name of a repository which I own. (Fun fact: This repository is an example project I created when I was experimenting with Flask. You can read more about it here)

GraphQL arguments can be either required or optional. This is controlled in the schema. You can easily see which arguments are required vs optional. An argument ending with an exclamation point (!) means that is required. One’s that do not end with an ! are optional.

Multiple Queries and Aliases

Let’s revisit our repository example and suppose we want to fetch information about two separate repositories at the same time. A naive approach will be to repeat the query twice. Let’s see what happens when we do that:

It threw an error basically saying that there’s an argument conflict. In GraphQL, we can’t query the same field with different arguments, which is what we did the in the last example: we called the repository object (of type Repository) with two different sets of arguments:

repository(owner: "umermansoor", name: "microservices")...
repository(owner: "graphql", name: "graphiql") ...

This REST like approach didn’t work. Good news is that there’s a really easy way to fetch both fields in one call. This is where aliases come in. They allow us to attach a custom name to each query. Here’s the last query that failed with aliases which nows works:

{
  pythonMicroservicesRepository: repository(owner: "umermansoor", name: "microservices") {
    name
    nameWithOwner
    description
    forkCount
  }
  
  graphqlRepository:repository(owner: "graphql", name: "graphiql") {
    name
    nameWithOwner
    description
    forkCount
  }
}

This is the result that I get. Notice that even with aliases, the shape of the response closely resembles the query. The name repository in the response is replaced by the alias that we provided in the query:

{
  "data": {
    "pythonMicroservicesRepository": {
      "name": "microservices",
      "nameWithOwner": "umermansoor/microservices",
      "description": "Example of Microservices written using Flask.",
      "forkCount": 199
    },
    "graphqlRepository": {
      "name": "graphiql",
      "nameWithOwner": "graphql/graphiql",
      "description": "An in-browser IDE for exploring GraphQL.",
      "forkCount": 890
    }
  }
}

Suppose you want to fetch repositories that belong to a user and the commits that the user has done. Think about it for a moment about how you’d normally do it using REST API.

In REST, it will typically take multiple calls. In GraphQL, you can fetch related objects easily in the same query. The first concept we’ll look at is called Connection. A connection allows fetching related objects in the same query. Objects are connected to other objects using edges. In other words, when you query for a connection, you’re traversing connection’s edges to get its nodes. A node is a generic term for an object that is accessible via an edge. You can read more about Connections here. I’m borrowing a diagram and an example below to complete the definition.

If we want to get first three friends (also users) that a user say caesar is connected to, we can run this query:

// Made up example. Will NOT work in GitHub GraphQL Explorer
{
  user(id: "caesar") {
    id
    name
    friendsConnection(first: 3) {
      edges {
        cursor
        node {
          name
        }
      }
    }
  }
}

You can try out a similar query in GitHub GraphQL Explorer. The following query will return first 5 public repositories for a alexcrichton (btw, who is an excellent developer!)

{
  user(login: "alexcrichton") {
    bio
    repositories(last: 5, privacy: PUBLIC) {
      edges {
        node {
          name
        }
      }
    }
  }
}

The first parameter that we passed to repositories connection in the last example allowed us to paginate results. There are several other pagination options available such as last, offset, after etc. You can read more here.

Operation Names and Variable Arguments

In our last example, we passed the argument (username) directly into the query. What if we want to make it dynamic and control it easily without editing the string in the query for each time we want to lookup a different user?

Up until now, we have been using a shorthand syntax to query GitHub GraphQL by omitting the optional query keyword. Imagine if we have a lot of queries. It will be nice to have a way to name the queries to make them easy to find. It turns out there’s a way in GraphQL to do this. We can have multiple queries and give them different names so we can choose at runtime which query we want to run.

In the image above, we have defined two different queries and gave them unique names: UserDetails1 and UserDetails2 to get details for two different GitHub accounts. It works but doesn’t look right (duplication). We are running the same query twice with different hardcoded usernames. Let’s make it dynamic by passing the username as a parameter at run-time. We can this by following these steps:

Define a named query e.g. GitHubUserDetails
Specify a query argument and make it required by adding exclamation point e.g. query GitHubUserDetails($username:String!)
Provide the value to the argument at run-time before running the query e.g. {"username": "alexcrichton"}

Putting all these together, we get:

query GitHubUserDetails($username:String!) {
  user(login: $username) {
    bio
    repositories(last: 5, privacy: PUBLIC) {
      edges {
        node {
          name
        }
      }
    }
  }
}

And in query parameters section at the bottom, set the value of the username parameter:

{
  "username": "alexcrichton"
}

Mutations

So far, we’ve been just fetching data. Just like in REST APIs (e.g. PUT method in REST,) there are ways in GraphQL to update or modify data on the server. This is achieved using what’s known as mutations. You can see a list of available mutations and more details in the Document Explorer. Let’s see how to use mutations using an example. In this example, we’ll add a new star to an existing repository.

Please note that GitHub has blocked mutations to third-party repositories. So you’ll need to follow these for one of your own repositories.

Before you can run a mutation to add a star to one of your repositories, you’d need to get the unique id called for your repository. It’s simple enough to get this information using a query like the one below. Please replace owner with your GitHub username (it has to be yours!) and name with the name of one of your repositories.

query FindRepositoryId {
  repository(owner: "umermansoor", name:"hadoop-java-example") {
    id
  }
}

You should see a response like this:

{
  "data": {
    "repository": {
      "id": "MDEwOlJlcG9zaXRvcnk3NTkxMzY1"
    }
  }
}

Now that we have id, we can run our mutation to add star to the repository.

mutation AddStarMutation($input: AddStarInput!) {
  addStar(input: $input) {
    clientMutationId
  }
}

Then in the Query Parameters section at the bottom, add the following code, replacing starrableId with the id you retrieved in the query above. clientMutationId could be anything and you can leave it as is.

{
  "input": {
    "starrableId": "MDEwOlJlcG9zaXRvcnk3NTkxMzY1", 
    "clientMutationId": "12345"
  }
}

Run the mutation now. If everything was done right, you should see that you’ve starred the repository.

That’s all. I hope this gave you a quick overview of GraphQL and its capabilities.

Note: While I have compared GraphQL and REST in this article, please keep in mind that GraphQL is not a replacement of REST. There are many cases where REST makes perfect sense (e.g. for internal APIs).

If you’re interested in learning more about GraphQL, I would highly recommend the video course below. It’s free and shows how to build a GraphQL server using Node.js and allow React frontend of query it using Apollo Client. Until next time!

== vs === in Javascript and Which Should be Used When

2019-10-12T00:00:00+00:00

In Javascript, we have couple of options for checking equality:

== (Double equals operator): Known as the equality or abstract comparison operator
=== (Triple equals operator): Known as the identity or strict comparison operator

In this post, we’ll explore the similarities and differences between these operators.

Let’s declare two variables foo and bar and compare them using both operators.

var foo = 13;
var bar = 13;

console.log(foo ==  bar); // true
console.log(foo === bar); // also true

In the above example, both operators returned the same answer i.e. true. So what’s the difference?

The Difference between `==` and `===`

The difference between == and === is that:

== converts the variable values to the same type before performing comparison. This is called type coercion.
=== does not do any type conversion (coercion) and returns true only if both values and types are identical for the two variables being compared.

Let’s take a look at another example:

var one = 1;
var one_again = 1;
var one_string = "1";  // note: this is string

console.log(one ==  one_again);  // true
console.log(one === one_again);  // true
console.log(one ==  one_string); // true. See below for explanation.
console.log(one === one_string); // false. See below for explanation.

Line 7: console.log(one == one_string) returns true because both variables, one and one_string contain the same value even though they have different types: one is of type Number whereas one_string is String. But since the == operator does type coercion, the result is true.
Line 8: console.log(one === one_string) returns false because the types of variables are different.

Is `===` Faster than `==`? A Quick Look at the Performance of the Two Operators

In theory, when comparing variables with identical types, the performance should be similar across both operators because they use the same algorithm. When the types are different, triple equals operator (===) should perform better than double equals (==) because it doesn’t have to do the extra step of type coercion. But does it? Here are some performance tests you could try yourself to see for yourself.

jsperf Test 2

If you look at the graph at the bottom of the tests, you’d see performance varies across different browser implementations and the gains in performance are almost negligible.

But if you think about it, performance is totally irrelevant and shouldn’t play a role in deciding when to use one operator over the other. Either you need type coercion or you don’t. If you don’t need it, don’t use double equals operator (==) because you might get unexpected results. Most linters will complain if you use ==. To further scare you away from ==: it’s pretty confusing and has odd rules. For example, "1" == true or "" == 0 will return true. For more peculiarities, take a look at the Javascript Equality Table.

In short, always use === everywhere except when you need type coercion (in that case, use ==.)

Inequality Operators: `!=` and `!==`

== and === have their counterparts when it comes to checking for inequality:

!=: Converts values if variables are different types before checking for inequality
!==: Checks both type and value for the two variables being compared

var one = 1;
var one_again = 1;
var one_string = "1";  // note: this is a string

console.log(one != one_again);  // false
console.log(one != one_string); // false
console.log(one !== one_string);// true. Types are different

Equality Operators and Objects (and other reference types)

So far, we have been exploring equality or inequality operators using primitive types. What about reference types like Arrays or Objects. If we create two arrays that have identical contents, can we compare them using equalty operators the same way we do it for primitives? The answer is no, you can’t. Let’s take a look at an example:

var a1 = [1,2,3,4,5]
var a2 = [1,2,3,4,5]

console.log(a1 ==  a2); // false
console.log(a1 === a2); // false

Here, both the == and === return the same answer: false. What’s happening here is that both a1 and a2 are pointing to different objects in memory. Even though the array contents are the same, these essentially have different values. Same applies to objects and other reference types.

ECMAScript 6: Object.is()

I said that the beginning of the article that are couple of options for checking equality in Javascript. That isn’t true anymore. ECMA Script 6 introduced a third method for comparing values:

Object.is()

Triple equals operator (===) is the recommended way for value comparison, but it’s not perfect. Here’s couple of examples where its behavior is confusing:

console.log(+0 === -0);   // true 
console.log(NaN === NaN); // false

To make comparisons less confusing, ECMAScript 6 introduced a new method: Object.is(). It takes two arguments and returns true if both the values and types are equal. Essentially, its identical to the === operator, but without its quirks. Let’s take a look at some examples:

console.log(Object.is(2, 2));    // true
console.log(Object.is(2, "2"));  // false. Different types

// And it fixes the quirks of ===
console.log(Object.is(+0, -0));  // false
console.log(Object.is(NaN, NaN));// true

Next up, read our Free Primer on JavaScript.

What are -Xms and -Xms parameters in Java/JVM (Updated up to Java 13)

2019-09-02T00:00:00+00:00

In short,

Xmx specifies the maximum heap size available to an application
Xms specifies the minimum heap size available to an application

These are Java Virtual Machine (JVM) parameters that are used to specify memory boundaries for Java applications. They are often used when troubleshooting performance issues or OutOfMemoryErrors. They control the amount of memory that is available to a Java application. The Xmx parameter specifies the maximum memory an app can use, where as Xms specifies the minimum or the initial memory pool. If your application exceeds the maximum memory (allocated using the Xmx) and the garbage collector cannot free up memory, the JVM will crash with a OutOfMemoryError. If you’re interested, I wrote an article explaining with examples how garbage collection works and its generations.

 
$ java -Xmx256m Xmx1024m -jar yourapp.jar
 

In the example above, the application yourapp.jar will get an initial memory pool of 256 megabytes and a maximum up to 1024 megabytes. In 256m, the m stands for megabytes. You can use g or G to indicate gigabytes.

Xmx1g or Xmx1G: Set the maximum memory size to 1 gigabytes.
Xmx1024m or Xmx1024M: Set the maximum memory size to 1024 megabytes.
Xmx1024000k or Xmx1024000K: Sets the maximum memory size to 1024000 kilobytes.

It’s important to note that both Xmx and Xms are optional. If these are not provided, the Java Virtual Machine (JVM) will use default values for them.

Default Java Xmx and Xms Values

The default values vary and depend on different factors. It depends on the amount of physical memory of the system, JVM mode (e.g. -server vs -client) and other factors like JVM implementation and version. Typically, the default values are calculated as follows:

Initial heap size of 1/64 of physical memory (for Xms)
Maximum heap size of 1/4 of physical memory (for Xmx)

An easy way to determine the default settings is to use the Print Flags option. It will show xms (InitialHeapSize) and xmx (MaxHeapSize) in bytes. You’ll need to convert manually to MB or GB.

java -XX:+PrintCommandLineFlags -version

On my machine (Macbook Pro with 8 GB of memory) I got the following output:

-XX:InitialHeapSize=134217728 -XX:MaxHeapSize=2147483648 -XX:+PrintCommandLineFlags 
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC 

So on my machine with 8 GB of total physical memory, I get:

Xms (InitialHeapSize): 134217728 bytes or 134 MB (~ roughly 1/64th of 8 GB)
Xmx (MaxHeapSize): 2147483648 bytes or 2 GB (~ roughly 1/4th of 8 GB)

You can specify either Xms, Xmx or both. If you don’t specify either one of them, the default value will be used. In the example below, the maximum memory will be limited to 1024 megabytes. The initial memory will use the default value.

 
java -Xmx1024m -jar yourapp.jar

Here’s a good YouTube video that walks through the process of troubleshooting memory related errors and shows to fix them using examples.

Java 13 and the Z Garbage Collector

Java 13 introduced a new garbage collector called ZGC. One of its features includes an optimization to return un-used memory to the operating system. This feature is enabled by default and it will not return memory such that heap size shrinks below Xms. So if you’re setting Xms to equal Xmx (as many developers do,) it will essentially disable the feature.

If you want to see all available JVM parameters, you can use the java -X switch e.g.

$ java -X
    -Xmixed           mixed mode execution (default)
    -Xint             interpreted mode execution only
    -Xbootclasspath:<directories and zip/jar files separated by :>
                      set search path for bootstrap classes and resources
    -Xbootclasspath/a:<directories and zip/jar files separated by :>
                      append to end of bootstrap class path
    -Xbootclasspath/p:<directories and zip/jar files separated by :>
                      prepend in front of bootstrap class path
    -Xdiag            show additional diagnostic messages
    -Xnoclassgc       disable class garbage collection
    -Xincgc           enable incremental garbage collection
    -Xloggc:<file>    log GC status to a file with time stamps
    -Xbatch           disable background compilation
    -Xms<size>        set initial Java heap size
    -Xmx<size>        set maximum Java heap size
    -Xss<size>        set java thread stack size
    -Xprof            output cpu profiling data
    -Xfuture          enable strictest checks, anticipating future default
    -Xrs              reduce use of OS signals by Java/VM (see documentation)
    -Xcheck:jni       perform additional checks for JNI functions
    -Xshare:off       do not attempt to use shared class data
    -Xshare:auto      use shared class data if possible (default)
    -Xshare:on        require using shared class data, otherwise fail.
    -XshowSettings    show all settings and continue
    -XshowSettings:all
                      show all settings and continue
    -XshowSettings:vm show all vm related settings and continue
    -XshowSettings:properties
                      show all property settings and continue
    -XshowSettings:locale
                      show all locale related settings and continue

The -X options are non-standard and subject to change without notice.

Spring Boot - Replace Tomcat With Jetty As the Embedded Server

2019-09-01T00:00:00+00:00

Apache Tomcat and Eclipse Jetty are two of the most popular web servers and Java Servlet containers. Tomcat is more widely used compared to Jetty and has significantly more market share. On the other hand, Jetty is light-weight, more compact and has a smaller CPU and memory footprint. For this reason, it is easier to work with it in development than Tomcat. This is not to suggest that Jetty isn’t good for production - in my experience, it is as performant as Tomcat if not more.

Many developers prefer Jetty over Tomcat during the development stage when they want to rapidly launch and test web apps on their local machines. Spring Boot web starter uses Tomcat as the default embedded server. Let’s take a look at how to change it to Jetty.

If you’d like to change the embedded web server to Jetty in a new Spring Boot web starter project, you’ll have to:

Exclude Tomcat from web starter dependency, since it is added by default
Add the Jetty dependency

Step 1: Exclude Tomcat

Find the following dependency in pom.xml:

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
</dependency>

Replace it with:

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
  <exclusions>
    <exclusion>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-tomcat</artifactId>
    </exclusion>
  </exclusions>
</dependency>

Step 2: Add Jetty

Add the following dependency to your pom.xml:

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-jetty</artifactId>
</dependency>

Please note that there are some other starters e.g. ThymeLeaf templating engine, etc. that might pull in Tomcat by default. If you’re using one of these, you’ll have to manually exclude Tomcat from all such dependencies. When I’m running into issues, I look at maven dependencies to check if Tomcat is being pulled in. You could check for this by either inspecting the dependencies on command line e.g. mvn dependency:tree -Dverbose or using the inspector in your favourite IDE e.g. IntelliJ or Eclipse.

GitHub Project

I created a SpringBoot Jetty Starter project on GitHub which excludes Tomcat and uses Jetty as the web server. Here’s the complete pom file which shows how this is done.

To run the project, you’ll need to clone it first.

git clone https://github.com/codeahoy/BootWebJetty.git

To run it,

mvn clean package spring-boot:run

In the output, you should see Jetty running on port 8080.

Jetty started on port(s) 8080 (http/1.1) with context path '/'

I created a sample controller called greeting, so you should be able to open http://127.0.0.1:8080/greeting and see a message printed to the browser window.

The project uses SpringBoot 2.2.4 and Java 8. You can change this in the pom file if you want to use different versions.

That’s all. If you found this post useful, please share it using the sharing buttons below. It will help us grow. Thank you.

How Docker Works? Under the Hood Look at How Containers Work on Linux

2019-04-12T00:00:00+00:00

Docker is awesome. It enables software developers to package, ship and run their applications anywhere without having to worry about setup or dependencies. Combined with Kubernetes, it becomes even more powerful for streamlining cluster deployments and management. I digress. Back to Docker. Docker is loved by software developers and its adoption rate has been remarkable.

In this post, we’ll look at how Docker works under the hood. Docker uses a technology called “Containerization” to do its magic and that’s what we are going to explore next.

Why Do We Need Containers?

Let’s say you want to run a software Foo on your computer. Foo requires Node.js version 10 (assume Foo is incompatible with newer Node.js versions), so you install Node 10 on your machine. Later, you want to run another software, Bar, which requires Node.js version 15.

This has created a problem. (Assume we can’t use nvm to switch between Node versions easily.)

One way we could solve this problem is by using Virtual Machines or VMs to create isolated environments for running Foo and Bar. You can create one VM with Foo and Node 10 and another with Bar and Node 15. Voila, we are back in business.

However, there’s an issue with this approach: it’s very inefficient. Each VM requires its own operating system. We are now running two separate guest operating systems on our computer just to run two different processes.

What if there was a way to run Foo and Bar and your machine without running two extra operating systems?

Let’s review the interaction between processes and operating system. Whenever a process wants to do anything, it asks the operating system. Processes like Foo and Bar would ask the OS questions like, “which Node version do you have installed?” or “how much memory is available to me?” or “what other processes are running?”

What if we could intercept and control the communication between processes and operating systems and send customized responses to processes to control their behavior? For example, when Foo and Bar ask “what Node version do you have installed”, we tell Foo that we have Node v10 and send its executable location to Foo. When Bar asks the same question, we respond with Node v16. In other words, what if we create a virtual OS for Foo and Bar?

This is exactly what containers do. They allow us to run different processes, Foo and Bar, on the same OS. A container runtime intercepts these questions from processes and gives a customized response. It will tell Foo that Node 10 is installed, and Bar will receive a response that the machine only has Node 15.

By strategically modifying responses to Foo and Bar, we have essentially isolated their environments without running separate operating systems. This strategy is called “containerization”.

Docker is a set of tools and a widely popular container runtime. It’s a complete platform for building, testing, deploying and publishing containerized applications. I say platform because Docker is a set of tools for managing all-things related to containers.

Now that we understand the use case for containers and what they are, let’s take a more in-depth look and how they actually work to achieve isolation.

What are Containers?

Containers provide a way to install and run your applications in isolated environments on a machine. Applications running inside a container are limited to resources (CPU, memory, disk, process space, users, networking, volumes) allocated for that container. Their visibility is limited to container’s resources and doesn’t conflict with other containers. You can think of containers as isolated sandboxes on a single machine for applications to run in.

As we have discussed, this concept is very similar to virtual machines. The key differentiator is that containers use a light-weight technique to achieve resource isolation. The technique used by containers exploits features of the underlying Linux kernel as opposed to hypervisor based approach taken by virtual machines. In other words, containers call Linux commands to allocate and isolate a set of resources and then runs your application in this space. Let’s take a quick look at two such features:

1. namespaces

I’m over simplifying but Linux namespaces basically allow users to isolate resources like CPU, between independent processes. A process’ access and visibility are limited to its namespace. So users can run processes in one namespace without ever having to worry about conflicting with processes running inside another namespace. Processes can even have the same PID on the same machine within different containers. Likewise, applications in two different containers can use port same ports (e.g. port 80).

2. cgroups

croups allow putting limits and constraints on available resources. For example, you can create a namespace and limit available memory for processes inside it to 1 GB on a machine that has say 16 GB of memory available.

By now, you’ve probably guessed how Docker works. Behind the scenes, when you ask Docker to run a container, it sets up a resource isolated environment on your machine. Then it copies over your packaged application and associated files to the filesystem inside the namespace. At this point, the environment setup is complete. Docker then executes the command that you specified and hands over the control.

In short, Docker orchestrates by setting up containers using Linux’s namespace and cgroups (and few other) commands, copying your application files to disk allocated for the container and then running the startup command. It also comes with a bunch of other tools for managing containers like the ability to list running containers, stopping containers, publishing container images, and many others.

Compared to virtual machines, containers are light weight and faster because they make use of the underlying Linux OS to run natively in loosely isolated environments. A virtual machine hypervisor creates a very strong boundary to prevent applications from breaking out of it, where as containers’ boundaries are not as strong. Another difference is that since namespace and cgroups features are only available on Linux, containers can not run on other operating systems. At this point you might be wondering how Docker runs on macOS or Windows? Docker actually uses a little trick and installs a Linux virtual machines on non-Linux operating systems. It then runs containers inside the virtual machine.

Let’s put everything that we have learned so far and create and run a Docker container from scratch. If you don’t already have Docker installed on your machine, head over here to install. In our super made up example, we’ll create a Docker container, download a web server written in C, compile it, run it and then connect to the web server from our web browser (in other words, from host machine that’s running the container.)

We’l start where all Docker projects start. By creating a file called Dockerfile. This file contains instructions that tell Docker how to create a docker image that’s used for creating and running containers. Since, we didn’t discuss images, let’s take a look at the official definition:

An image is an executable package that includes everything needed to run an application–the code, a runtime, libraries, environment variables, and configuration files. A container is a runtime instance of an image

Put simply, when you ask Docker to run a container, you must give it an image which contains:

File system snapshot containing your application and all of its dependencies.
A startup command to run when the container is launched.

Back to creating our Dockerfile so we can build an image. It’s extremely common in the Docker world to create images based on other images. For example, the official reds Docker image is based on ‘Debian’ file system snapshot (rootfs tarball), and installs on configures Redis on top of it.

In our example, we’ll base our image on Alpine Linux. When you see the term alpine in Docker, it usually means a stripped down, bare-essentials image. Alpine Linux image is about 5 MB in size!

Alright. Create a new folder (e.g. dockerprj) on your computer and then create a file called Dockerfile.

umermansoor:dockerprj$ touch Dockerfile

Paste the following in the Dockerfile.

# Use Alpine Linux rootfs tarball to base our image on
FROM alpine:3.9 

# Set the working directory to be '/home'
WORKDIR '/home'

# Setup our application on container's file system
RUN wget http://www.cs.cmu.edu/afs/cs/academic/class/15213-s00/www/class28/tiny.c \
  && apk add build-base \
  && gcc tiny.c -o tiny \
  && echo 'Hello World' >> index.html

# Start the web server. This is container's entry point
CMD ["./tiny", "8082"]

# Expose port 8082
EXPOSE 8082 

The Dockerfile above contains instructions for Docker to create an image. Essentially, we base our image on Alpine Linux (rootfs tarball) and set our working directory to be /home. Next, we downloaded, compiled and created an executable of a simple web server written in C. After, that we specify the command to be executed when container is run and expose container’s port 8082 to the host machine.

Now, let’s create the image. Running docker build in the same directory where you created Dockerfile should do the trick.

umermansoor:dockerprj$ docker build -t codeahoydocker .

If the command is successful, you should see something similar:

Successfully tagged codeahoydocker:latest

At this point, our image is created. It essentially contains:

Filesystem snapshot (Alpine Linux and the web server we installed)
Startup command (./tiny 8092)

Now that we’ve created the image, we can build and run a container from this image. To do so, run the following command:

umermansoor:dockerprj$ docker run -p 8082:8082 codeahoydocker:latest

Let’s understand what’s going on here.

With docker run, we asked Docker to create and start a container from the codeahoydocker:latest image. -p 8082:8082 maps port 8082 of our local machine to port 8082 inside the container. (Remember, our web server inside the container is listening for connections on port 8082.) You’ll not see any output after this command which is totally fine. Switch to your web browser and navigate to localhost:8082/index.html. You should see Hello World message. (Instructions on how to delete the image and container to clean up will be in comments.)

In the end, I’d like to add that while Docker is awesome and it’s a good choice for most projects, I don’t use it everywhere. In our case, Docker combined with Kubernetes makes it really easy to deploy and manage backend microservices. We don’t have to worry about provisioning a new environment for each service, configurations, etc. On the other hand, for performance intensive applications, Docker may not be the best choice. One of the projects I worked on had to handle long-living TCP connections from mobile game clients (1000s per machine.) Docker networking presented a lot of issues and I just couldn’t get the performance out of it and didn’t use it for the project.

Hope this was helpful. Until next time.

YAGNI, Cargo Cult and Overengineering - the Planes Won't Land Just Because You Built a Runway in Your Backyard

2017-08-19T00:00:00+00:00

It was April. Year was probably was 2010. The cold, snowy winter was finally coming to an end and the spring was almost in the air. I was preparing for my final exams. The review lectures were going on for the RDBMS course that I was enrolled in at my university.

Around the same time, I had started hearing and reading about the shiny, new technology that was going to change the way we use databases. The NoSQL movement was gaining momentum. I was reading blogs about how MongoDB is big time outperforming ancient, non web scale relational databases.

After the lecture, I asked my professor:

Me: So, between RDBMS and NoSQL databases, which one do you think is the best?

Professor: Well, it depends.

Me: Depends on what?

Professor: Depends on what you are trying to achieve. Both have their pros and cons. You pick the right tool for the job.

Me: But MySQL can’t really scale.

Professor: How do you think we got this far? Send me an email and I’ll send you some papers and practical uses in the industry.

SQL was hard for my brain, especially the joins. I loved NoSQL. Simple key->value model without any joins! RDBMS systems that were designed in 1960’s were simply not enough to keep up with modern demands. I had lost all interest in RDBMS and predicted they’ll just fade off in the next few years.

It’s 2012. We’re redesigning my employer’s flagship product. The first version was a monolith that used the boring MySQL. Spending too much time reading blogs and Hacker News comments section, we convinced ourselves that we need to go big and modern:

Break monolith into service-oriented architecture, aka, the SOA.
Replace MySQL with Cassandra (MySQL to Redis to Cassandra)

And we built it.

There was nothing wrong with the new system… except one major flaw. It was too complex for a small startup team to maintain. We had built a Formula One race car, that makes frequent pit-stops and requires very specialized maintenance, when we needed a Toyota Corolla that goes on for years and years on just the oil change.

Fast forward to 2017. It feels like almost all software developers I interview these days, have hands-on experience with microservices architecture and many have even actually used it in production.

A grey San Francisco afternoon. I’m conducting an on-site interview. Masters degree and 3 years of experience at a startup that looked like it didn’t make it. I asked him to tell me about the system he built.

Guy: We built the system using the microservices architecture. We had lots of small services which made our life really easy…

Me: Tell me more about it

Guy: Data was written to the BI system through a Kafka cluster. Hadoop and MapReduce system was built to process data for analytics. The system was super scalable.

I pressed him to tell me drawbacks of microservices architecture and problems it introduces. The guy tried to hand-wave his way through and was convinced, just like I was in 2010 about NoSQL databases, that there are absolutely no issues with microservices architecture.

I’m not saying microservices architecture is bad. It has its pros and cons. Organizations who have very complex systems that justify the operational burden, the overhead, it introduces. Martin Fowler, who coined the term microservices, warns us of the “microservices premium”:

The fulcrum of whether or not to use microservices is the complexity of the system you’re contemplating. The microservices approach is all about handling a complex system, but in order to do so the approach introduces its own set of complexities. When you use microservices you have to work on automated deployment, monitoring, dealing with failure, eventual consistency, and other factors that a distributed system introduces. There are well-known ways to cope with all this, but it’s extra effort, and nobody I know in software development seems to have acres of free time.

So my primary guideline would be don’t even consider microservices unless you have a system that’s too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don’t try to separate it into separate services.

Netflix is a great example. Their system grew too large and too complex to justify switching to microservices.

In almost all cases, you can’t go wrong by building a monolith first. You break your architecture into services-oriented or… microservices when the benefits outweigh the complexity.

The guy also mentioned Kafka. System that handles 2 million writes a second at LinkedIn:

Me: How much data do you stream to Kafka roughly?

Guy: We could stream gigabytes of logs…

Me: How much data are you streaming right now?

Guy: Not a whole lot right now because we only have 3 customers. But the system could scale up to support millions and millions of users. Also, with both Kafka and Hadoop clusters, we get fool-proof fault-tolerance

Me: How big is the team and company?

Guy: Overall, I guess there were (less than 10) people in the company. Engineering was about 5.

At this point, I was tempted to ask if he had ever heard of YAGNI:

Yagni … stands for “You Aren’t Gonna Need It”. It is a mantra from Extreme Programming … It’s a statement that some capability we presume our software needs in the future should not be built now because “you aren’t gonna need it”.

People sometimes have honest intentions and they don’t introduce new tools, libraries, frameworks, just for the sake of enhancing their resumes. Sometimes they simply speculate enormous growth and try to do everything up front to not have to do this work later.

The common reason why people build presumptive features is because they think it will be cheaper to build it now rather than build it later. But that cost comparison has to be made at least against the cost of delay, preferably factoring in the probability that you’re building an unnecessary feature, for which your odds are at least ⅔.

People know complexity is bad. No one likes to see bugs filed on JIRA or get PagerDuty alerts at 3 a.m. that something is wrong with Cassandra cluster. But why do software developers still do it? Why do they choose to build complex systems without proper investigation?

Are there other reasons for building complex features besides preparing for a hypothetical future?

I’m sure majority of you are familiar with the term cargo cult software development. Teams who slavishly and blindly follow techniques of large companies they idolize like Google, Apple or Amazon, in the hopes that they’ll achieve similar success by emulating their idols.

Just like the South Sea natives who built primitive runways, prayed and performed rituals in the hopes that planes would come in and bring cargo, the food and supplies. Richard Feynman warned graduates at the California Institute of Technology to not fall victim to the cargo cult thinking:

In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas — he’s the controller — and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes landed.

It reminded me of the time when we took a perfectly good monolith and created service-oriented architecture (SOA). It didn’t work. The planes didn’t land.

Whatever the reasons:

not wanting to be left out: the new Javascript framework from this week will be the next hottest thing.
enhancing resume with buzzwords: Monolith won’t impress recruiters / interviewers.
imitating heroes: Facebook does it, Google does it, Twitter does it.
technology that could keep up with the projected 5000% YoY company growth: YAGNI.
latest tools to convince people that you’re a proper SF bay area tech company.

Cargo-cult engineering just doesn’t work. You are not Google. What works for them will most likely not work for your much, much smaller company. Google actually needed MapReduce because they wanted to regenerate indexes for the entire World Wide Web, or something like that. They needed fault tolerance from thousands of commodity servers. They had 20 petabytes of data to process.

20 petabytes is just enormous. In terms of the number of disk drives, here’s what half of that, 10 petabytes, would look like:

To avoid falling in the cargo-cult trap, I have learned to do the following:

Focus on the problem first, not the solution. Don’t pick any tool until you have fully understood what you are trying to achieve or solve. Don’t give up solving the actual problem and make it all about learning and using the shiny new tech.
Keep it simple. It’s an over-used term, but software developers still just don’t get it. Keep. It. Simple.
If you are leaning towards something that Twitter or Google uses, do your homework and understand the real reasons why they picked that technology.
When thinking of growth, understand that the chances of your startup growing to be the size of Facebook are slim to none. Even if your odds are huge, is it really worth all this effort to set-up a ‘world-class foundation’ now vs doing it later?
Weigh operational burden and complexity. Do you really need multi-region replication for fault-tolerance in return of making your DevOps life 2x more difficult?
Be selfish. Do you want to be woken up in the middle of night because something, somewhere stopped working? There is nothing wrong with learning the new JavaScript framework from last week. Create a new project and publish it on GitHub. Don’t build production systems just because you like it.
Think about people who’d have to live with your mess. Think about your legacy. Do people remember you as the guy who built rock-solid systems or someone who left a crazy mess behind?
Share your ideas with experts and veterans and let them criticize. Identify people in other teams who you respect and who’d disagree freely.
Don’t jump to conclusions on the results of quick experiments. HelloWorld prototypes of anything is easy. Real-life is very different from HelloWorld.

We discussed YAGNI in this post, but it is generally applied in the context of writing software, design patterns, frameworks, ORMs, etc. Things that are in control of one person.

YAGNI is coding what you need, as you need, refactoring your way through.

Back to the guy I interviewed. It’s highly unlikely, even for a small startup, that a software developer in the trenches was allowed to pick Hadoop, Kafka and microservices architecture. Cargo cult practices usually start from someone higher up the ranks. A tech leader who may be a very smart engineer, but very bad at making rational decisions. Someone who probably spends way too much time reading blogs and tries very hard keep up with the Amazon or Google way of building software.

2017 is almost over. NoSQL has matured. MongoDB is on its way out. DynamoDB is actually a very solid product and is maturing really well. RDBMS systems didn’t die. One can argue they are actually doing pretty good. StackOverflow is powered by just 4 Microsoft SQL Servers. Uber runs on MySQL.

You may have great reasons to use MapReduce or SOA. What matters is how you arrive at your decision. Whether by careful, sane thought or by jumping on the bandwagon and cargo-cult’ing.

As my professor said: “Pick the right tool for the job.” I’ll also add: don’t build Formula One cars when you need a Corolla.

Caching Strategies and How to Choose the Right One

2017-08-11T00:00:00+00:00

👉 Read First: A Brief Overview of Caching

Caching is one of the easiest ways to increase system performance. Databases can be slow (yes even the NoSQL ones) and as you already know, speed is the name of the game.

If done right, caches can reduce response times, decrease load on database, and save costs. There are several strategies and choosing the right one can make a big difference. Your caching strategy depends on the data and data access patterns. In other words, how the data is written and read. For example:

is the system write heavy and reads less frequently? (e.g. time based logs)
is data written once and read multiple times? (e.g. User Profile)
is data returned always unique? (e.g. search queries)

A caching strategy for Top-10 leaderboard system for mobile games will be very different than a service which aggregates and returns user profiles. Choosing the right caching strategy is the key to improving performance. Let’s take a quick look at various caching strategies.

Cache-Aside

This is perhaps the most commonly used caching approach, at least in the projects that I worked on. The cache sits on the side and the application directly talks to both the cache and the database. There is no connection between the cache and the primary database. All operations to cache and the database are handled by the application. This is shown in the figure below.

Here’s what’s happening:

The application first checks the cache.
If the data is found in cache, we’ve cache hit. The data is read and returned to the client.
If the data is not found in cache, we’ve cache miss. The application has to do some extra work. It queries the database to read the data, returns it to the client and stores the data in cache so the subsequent reads for the same data results in a cache hit.

Use Cases, Pros and Cons

Cache-aside caches are usually general purpose and work best for read-heavy workloads. Memcached and Redis are widely used. Systems using cache-aside are resilient to cache failures. If the cache cluster goes down, the system can still operate by going directly to the database. (Although, it doesn’t help much if cache goes down during peak load. Response times can become terrible and in worst case, the database can stop working.)

Another benefit is that the data model in cache can be different than the data model in database. E.g. the response generated as a result of multiple queries can be stored against some request id.

When cache-aside is used, the most common write strategy is to write data to the database directly. When this happens, cache may become inconsistent with the database. To deal with this, developers generally use time to live (TTL) and continue serving stale data until TTL expires. If data freshness must be guaranteed, developers either invalidate the cache entry or use an appropriate write strategy, as we’ll explore later.

Read-Through Cache

Read-through cache sits in-line with the database. When there is a cache miss, it loads missing data from database, populates the cache and returns it to the application.

Both cache-aside and read-through strategies load data lazily, that is, only when it is first read.

Use Cases, Pros and Cons

While read-through and cache-aside are very similar, there are at least two key differences:

In cache-aside, the application is responsible for fetching data from the database and populating the cache. In read-through, this logic is usually supported by the library or stand-alone cache provider.
Unlike cache-aside, the data model in read-through cache cannot be different than that of the database.

Read-through caches work best for read-heavy workloads when the same data is requested many times. For example, a news story. The disadvantage is that when the data is requested the first time, it always results in cache miss and incurs the extra penalty of loading data to the cache. Developers deal with this by ‘warming’ or ‘pre-heating’ the cache by issuing queries manually. Just like cache-aside, it is also possible for data to become inconsistent between cache and the database, and solution lies in the write strategy, as we’ll see next.

Write-Through Cache

In this write strategy, data is first written to the cache and then to the database. The cache sits in-line with the database and writes always go through the cache to the main database. This helps cache maintain consistency with the main database.

Here’s what happens when an application wants to write data or update a value:

The application writes the data directly to the cache.
The cache updates the data in the main database. When the write is complete, both the cache and the database have the same value and the cache always remains consistent.

Use Cases, Pros and Cons

On its own, write-through caches don’t seem to do much, in fact, they introduce extra write latency because data is written to the cache first and then to the main database (two write operations.) But when paired with read-through caches, we get all the benefits of read-through and we also get data consistency guarantee, freeing us from using cache invalidation (assuming ALL writes to the database go through the cache.)

DynamoDB Accelerator (DAX) is a good example of read-through / write-through cache. It sits inline with DynamoDB and your application. Reads and writes to DynamoDB can be done through DAX. (Side note: If you are planning to use DAX, please make sure you familiarize yourself with its data consistency model and how it interplays with DynamoDB.)

Write-Around

Here, data is written directly to the database and only the data that is read makes it way into the cache.

Use Cases, Pros and Cons

Write-around can be combine with read-through and provides good performance in situations where data is written once and read less frequently or never. For example, real-time logs or chatroom messages. Likewise, this pattern can be combined with cache-aside as well.

Write-Back or Write-Behind

Here, the application writes data to the cache which stores the data and acknowledges to the application immediately. Then later, the cache writes the data back to the database.

This is very similar to to Write-Through but there’s one crucial difference: In Write-Through, the data written to the cache is synchronously updated in the main database. In Write-Back, the data written to the cache is asynchronously updated in the main database. From the application perspective, writes to Write-Back caches are faster because only the cache needed to be updated before returning a response.

This is sometimes called write-behind as well.

Use Cases, Pros and Cons

Write back caches improve the write performance and are good for write-heavy workloads. When combined with read-through, it works good for mixed workloads, where the most recently updated and accessed data is always available in cache.

It’s resilient to database failures and can tolerate some database downtime. If batching or coalescing is supported, it can reduce overall writes to the database, which decreases the load and reduces costs, if the database provider charges by number of requests e.g. DynamoDB. Keep in mind that DAX is write-through so you won’t see any reductions in costs if your application is write heavy. (When I first heard of DAX, this was my first question - DynamoDB can be very expensive, but damn you Amazon.)

Some developers use Redis for both cache-aside and write-back to better absorb spikes during peak load. The main disadvantage is that if there’s a cache failure, the data may be permanently lost.

Most relational databases storage engines (i.e. InnoDB) have write-back cache enabled by default in their internals. Queries are first written to memory and eventually flushed to the disk.

Summary

In this post, we explored different caching strategies and their pros and cons. In practice, carefully evaluate your goals, understand data access (read/write) patterns and choose the best strategy or a combination.

What happens if you choose wrong? One that doesn’t match your goals or access patterns? You may introduce additional latency, or at the very least, not see the full benefits. For example, if you choose write-through/read-through when you actually should be using write-around/read-through (written data is accessed less frequently), you’ll have useless junk in your cache. Arguably, if the cache is big enough, it may be fine. But in many real-world, high-throughput systems, when memory is never big enough and server costs are a concern, the right strategy, matters.

I hope you enjoyed this post. Let me know in the comments section below which type of caching strategies you used in your projects. Until next time.

Basics of Java Garbage Collection

2017-08-06T00:00:00+00:00

Knock, knock.

Who’s there?

…long GC pause…

Java.

It’s an old joke from the time when Java was new and slow compared to other languages. Over time, Java became a lot faster. Today it powers many real-time applications with hundreds of thousands of concurrent users. These days, the biggest impact on Java’s performance comes from its garbage collection. Fortunately, in many cases, it can be tweaked and optimized to improve performance.

For most applications, the default settings of the JVM work fine. But when you start noticing performance issues caused by garbage collection and giving more heap memory isn’t possible, you need to tune and optimize the garbage collection. For most developers, it’s a chore. It requires patience, good knowledge of how garbage collection works and an understanding of application’s behavior. This post is a high-level overview of Java’s garbage collection with some examples of troubleshooting performance issues.

Let’s get started.

Java ships with several garbage collectors. More specifically, these are different algorithms that run in their own threads. Each works differently and has pros and cons. The most important thing to keep in mind is that all garbage collectors stop the world. That is, your application is put on hold or paused, as the garbage is collected and taken out. The main difference among the algorithms is how they stop the world. Some algorithms sit completely idle until the garbage collection is absolutely needed and then pause your application for a long period while others do most of their work concurrently with your application and thus need a shorter pause during stop the world phase. The best algorithm depends on your goals: are your optimizing for throughput where long pauses every now and then are tolerable or you are optimizing for low latency by spreading it out and having short pauses all along.

To enhance the garbage collection process, Java (HotSpot JVM, more accurately) divides up the heap memory into two generations: Young Generation and Old Generation (also called Tenured). There is also a Permanent Generation, but we won’t cover it in this post.

Young generation is where young objects live. It’s further subdivided into the following areas:

Eden Space
Survivor Space 1
Survivor Space 2

By default, Eden is bigger than the two survivor spaces combined. On my Mac OS X with 64-bit HotSpot JVM, Eden takes about 76% of all the young generation space. All objects are first created here. When Eden is full, a minor garbage collection is triggered. All new objects are quickly inspected to check their eligibility for garbage collection. The ones that are dead, that is, aren’t referenced (ignoring reference strength for this discussion) from other objects are marked as dead and garbage collected. The surviving objects are moved to one of the empty ‘survivor spaces’. Which one of two survivor spaces? To answer this question, let’s discuss survivor spaces.

The reason for having two survivor spaces is to avoid memory fragmentation. Imagine if there was just one survivor space. While you are at it, also imagine survivor space as a contiguous array of memory. When young generation GC runs through the array, it identifies dead objects for removal. This would leave holes in memory where objects previously lived and compaction will be needed. To avoid compaction, HotSpot JVM just copies all surviving objects from the survivor space to the other (empty) survivor space so that there are no holes or empty spaces. While we are discussing compaction, please note that old generation garbage collectors (with the exception of CMS) perform compaction on the old generation section of the heap memory to defragment it.

In short, minor garbage collections (triggered when Eden is full) ping-pong live objects from Eden and one of the survivor space (known as the ‘from’ survivor space in logs) to the other (known as the ‘to’ survivor space). This happens until one of the following happens:

Objects reach maximum tenuring threshold, in other words, have ping-pong’ed enough times that they aren’t young anymore,
There is no room in survivor space to receive newly birthed objects (We’ll revisit this later.)

When this happens, objects are moved to the old generation. (There could be other conditions but I’m not aware of them.) Let’s try to understand with a real example. Suppose we have the following application that creates a few ‘long-lived objects’ during initialization and creates many short-lived during its operation. (E.g. a web server that allocates short-lived objects for each incoming request.)

private static void createFewLongLivedAndManyShortLivedObjects() {
        HashSet<Double> set = new HashSet<Double>();

        long l = 0;
        for (int i=0; i < 100; i++) {
            Double longLivedDouble = new Double(l++);
            set.add(longLivedDouble);  // add to Set so the objects continue living outside the scope
        }

        while(true) { // Keep creating short-lived objects. Extreme but illustrates the point
            Double shortLivedDouble = new Double(l++);
        }
}

Let’s enable garbage collection logs and other settings using the following JVM command line arguments:

-Xmx100m                     // Allow JVM 100 MB of heap memory
-XX:-PrintGC                 // Enable GC Logs
-XX:+PrintHeapAtGC           // Enable GC logs
-XX:MaxTenuringThreshold=15  // Allow objects to live in the young space longer
-XX:+UseConcMarkSweepGC      // Ignore for now; covered later
-XX:+UseParNewGC             // Ignore for now; covered later

The application logs showing the state before and after garbage collection are as follows:

Heap <b>before</b> GC invocations=5 (full 0):
 par new (<u>young</u>) generation total 30720K, used 28680K
  eden space 27328K,   <b>100%</b> used
  from space 3392K,   <b>39%</b> used
  to   space 3392K,   0% used
 concurrent mark-sweep (<u>old</u>) generation total 68288K, used <b>0K</b> <br/>
Heap <b>after</b> GC invocations=6 (full 0):
 par new generation (<u>young</u>) total 30720K, used 1751K
  eden space 27328K,   <b>0%</b> used
  from space 3392K,   <b>51%</b> used
  to   space 3392K,   0% used
 concurrent mark-sweep (<u>old</u>) generation total 68288K, used <b>0K</b>

From the logs, we can see a few things. The first thing to notice is that there have been 5 minor garbage collections before this one (total of 6.) Eden was 100% used which triggered it. One of survivor space is 39% used and as such has some room available. After the garbage collection is over, we can see that Eden went back to 0% and survivor space increased to 59%. This means that live objects from Eden and survivor space were moved to second survivor space and dead one’s were garbage collected. How can we tell that some dead objects were collected? We can see that Eden is much larger than survivor space (27328K vs 3392K) and since survivor space size only slightly increased, a large number of objects must have been collected. The old generation space stayed completely empty before and after the garbage collection (Recall that the tenuring threshold was set to 15.)

Let’s try another experiment. Let’s run an application that is only creating short-lived objects in multiple threads. Based on what we’ve discussed so far, none of these objects should go to the old generation; minor garbage collection should be able to clean them up.

private static void createManyShortLivedObjects() {
        final int NUMBER_OF_THREADS = 100;
        final int NUMBER_OF_OBJECTS_EACH_TIME = 1000000;

        for (int i=0; i<NUMBER_OF_THREADS; i++) {
            new Thread(() -> {
                    while(true) {
                        for (int i=0; i<NUMBER_OF_OBJECTS_EACH_TIME; i++) {
                            Double shortLivedDouble = new Double(1.0d);
                        }
                        sleepMillis(1);
                    }
                }
            }).start();
        }
    }
}

For this example, I’m gave the JVM only 10 MB of memory. Let’s look at the GC logs.

Heap <b>before</b> GC invocations=0 (full 0):
 par new (<u>young</u>) generation total 3072K, used 2751K
  eden space 2752K,  99% used
  from space 320K,   0% used
  to   space 320K,   0% used
 concurrent mark-sweep (<u>old</u>) generation total 6848K, used <b>0K</b> <br/>
Heap <b>after</b> GC invocations=1 (full 0):
 par new generation  (<u>young</u>)  total 3072K, used 318K
  eden space 2752K,   0% used
  from space 320K,  99% used
  to   space 320K,   0% used
 concurrent mark-sweep (<u>old</u>) generation total 6848K, used <b>76K</b>

Not what we predicted. We can see that this time, the old generation received objects right after the first minor garbage collection. We know that these objects are short-lived and tenuring threshold is set to 15 and this is the first collection. What happened is the following: the application created a large number of objects which filled up Eden space. Minor garbage collection ran and tried to collect garbage. However, most of these short-lived objects were active during the GC, i.e. were being referenced from a live thread and being processed. The young generation garbage collector had no choice but to push these objects to the old generation. This is bad because the objects that got pushed to the old generation were prematurely aged and can only be cleaned up by old generation’s major garbage collection which usually takes more time. With a particular GC algorithm that we’ll cover later, CMS, major GC is triggered when the old generation memory is 70% full. This default value can be changed with the -XX:CMSInitiatingOccupancyFraction=70 argument.

How to prevent premature aging of short-lived objects? There are several ways. One theoretical way is to estimate the number of active short-lived objects and size the young generation appropriately. Let us make the following changes:

Young Generation by default is 1/3 of the total heap. Let’s change this using the -XX:NewRatio=1 which gives young generation more memory (~3.4 MB compared to the 3.0 MB the last time.)
Also increase the survivor space ratio using the -XX:SurvivorRatio=1 argument. (~1.6MB each compared to 0.3 MB the last time.)

The problem was fixed. After 8 minor garbage collections, the old generation space was still empty.

Heap <b>before</b> GC invocations=7 (full 0):
 par new generation   total 3456K, used 2352K
  eden space 1792K,  99% used
  from space 1664K,  33% used
  to   space 1664K,   0% used
 concurrent mark-sweep generation total 5120K, used <b>0K</b> <br/>
Heap <b>after</b> GC invocations=8 (full 0):
 par new generation   total 3456K, used 560K
  eden space 1792K,   0% used
  from space 1664K,  33% used
  to   space 1664K,   0% used [
 concurrent mark-sweep generation total 5120K, used <b>0K</b>

This is in no way an exhaustive method of tuning garbage collection. I’m simply trying to demonstrate the steps involved. For real applications, optimum settings are found as a result of trial and error with different settings. For example, we could have also fixed the problem by doubling the total heap memory size.

Garbage Collection Algorithms

Now that we have covered generations, let’s look at garbage collection algorithms. HotSpot JVM comes with several algorithms for young and old generations. At a high level, there are three general types of collection algorithms, each with its own performance characteristic:

serial collector uses a single thread to perform all garbage collection work, which makes it relatively efficient because there is no communication overhead between threads. It is best-suited to single processor machines -XX:+UseSerialGC.

parallel collector (also known as the throughput collector) performs minor collections in parallel, which can significantly reduce garbage collection overhead. It is intended for applications with medium-sized to large-sized data sets that are run on multiprocessor or multithreaded hardware.

concurrent collector performs most of its work concurrently (for example, while the application is still running) to keep garbage collection pauses short. It is designed for applications with medium-sized to large-sized data sets in which response time is more important than overall throughput because the techniques used to minimize pauses can reduce application performance.

HotSpot JVM allows you to configure separate GC algorithms for young and old generations. But you can only pair up compatible algorithms. For example, you cannot pair up Parallel Scavenge for young generation collector with Concurrent Mark Sweep for old generation collector because they are not compatible. To make it easier for you, I was going to make an infographic to show which garbage collectors are compatible, however, luckily I searched first and found one created by JVM engineer, Jon Masamitsu.

“Serial” is a stop-the-world, copying collector which uses a single GC thread.

“Parallel Scavenge” is a stop-the-world, copying collector which uses multiple GC threads.

“ParNew” is a stop-the-world, copying collector which uses multiple GC threads. It differs from “Parallel Scavenge” in that it has enhancements that make it usable with CMS. For example, “ParNew” does the synchronization needed so that it can run during the concurrent phases of CMS.

“Serial Old” is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.

“CMS” (Concurrent Mark Sweep) is a mostly concurrent, low-pause collector.

“Parallel Old” is a compacting collector that uses multiple GC threads.

Concurrent Mark Sweep (CMS), paired with ParNew, works really well for server-side applications processing live requests from clients. I have been using it with ~ 10GB of heap memory and it keeps response times steady and GC pauses are short. Some developers I know use Parallel collectors (Parallel Scavenge + Parallel Old) and are happy with results.

One important thing to know about the CMS is that there have been calls to deprecate it and it will probably happen in Java 9 :’( Oracle recommends that the new concurrent collector, the Garbage-First or the G1, introduced first with Java, be used instead:

The G1 collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with high probability, while achieving high throughput.

G1 works on both old and young generation. It is optimized for larger heap sizes (>10 GB). I’ve not experienced G1 collector first-hand and developers in my team are still using CMS, so I can’t yet compare the two. A quick online search reveals benchmarks showing CMS outperforming G1. I’d tread carefully, but G1 should be fine. It can be enabled with:

-XX:+UseG1GC

Hope you found this post useful. Until next time.

Message Batching to Increase Throughput and Reduce Costs

2017-08-03T00:00:00+00:00

A while ago, I was working on a backend system on the AWS cloud. Individual services in the system communicated by exchanging asynchronous messages with each other using Amazon SQS. During early stages of development, we ran small load tests and found that CPU use was high and we would need more servers to handle the load. (Estimated peak load was 50,000 requests per second.) Services were generating lots of small messages every second and the profiler showed that threads responsible for handling and sending messages to SQS, one message at a time, were using CPU more than they should. This was affecting performance and throughput.

To overcome this challenge, we took a page from Amazon’s best practices advice, and introduced batching and buffering on senders to group multiple messages and send as one batch. This simple fix increased the throughput, cut down SQS costs (SQS pricing is by number of messages), and had very little impact on latency. Win, win, win. The custom batching algorithm was simple: batch up to 15 messages or wait up to a maximum of 50 milliseconds. If either 15 messages arrive quickly to form a batch or 50 milliseconds elapse, the batch is sent out. These figures (batch size and maximum time) were established after trial and error, tuned for the best throughput and latency. The application was multi-threaded with hundreds of active threads, so the system allowed multiple concurrent batches to be active at the same time to reduce thread blocking.

The only downside is that if the application crashes after retrieving batch from the queue but before processing messages in it, we can lose some or all of the messages. To deal with complete message loss, the system can implement retries and resend messages upon timeout. Clients will see 2x response times, but at least get their response back.

If you are building applications which generate and exchange a lot of messages in short periods, batching can increase throughput, performance and in the case of SQS, reduce costs. I recommend using message batching for SQS even if only for cost reasons.

Until next time.

Amazon DynamoDB Auto Scaling

2017-07-29T00:00:00+00:00

Amazon DynamoDB supports Auto Scaling which is a fantastic feature.

When enabled, Auto Scaling adjusts read and write capacities of DynamoDB tables (and global secondary indexes) automatically based on the demand. If you haven’t used DynamoDB before, you might be wondering why is this important? Before Auto Scaling, the users were required to provide fixed capacities for their tables. These capacities were static and didn’t respond to traffic demands. This was problematic because:

The application performance and high-availability was compromised whenever the utilization exceeded the provisioned throughput. When this happened, DynamoDB throttled requests, which resulted in loss of data or poor user experience.
Cost control was poor at best. DynamoDB charges you by how much you provision. You end up paying the full cost by provisioned amounts, even if you use less than what you provisioned. In other words, if you overprovision for peak load, you’ll pay extra during the non-peak hours, when the capacity isn’t being full utilized. On the other hand, if you underprovision, the performance of your application will suffer due to throttling when the load exceeds the provisioned capacity.

Many real-world use cases are difficult to predict in advance and fluctuations are common. Speaking of traffic fluctuations, they are prevalent and hard to deal with. In mobile gaming, the traffic can increase suddenly if Apple or Google features the game, or the publisher runs a massive ad campaign. A website can suddenly see a large number of visitors if an article is picked up by a major newspaper or news aggregator site.

What is DynamoDB Auto Scaling?
- Historical Perspective
- Auto Scale Target Utilization
DynamoDB Auto Scaling Pricing
DynamoDB Auto Scaling vs On Demand
- When to use Auto Scaling vs On Demand?
DynamoDB and DAX

What is DynamoDB Auto Scaling?

DynamoDB Auto Scaling feature lets you automatically manage throughput in response to your traffic patterns without throttling your users. You can assign minimum and maximum provisioned capacities to a table (or Global Secondary Index). When the traffic goes up, the table will increase its provisioned read and write capacity. When the traffic goes down, it will decrease the throughput so that you don’t pay for unused provisioned capacity.

Historical Perspective

Before Auto Scaling feature was introduced in 2017 (fun fact: I was there at the re:Invent when they announced it), it was challenging for Engineering and DevOps teams to keep track of and keep up with manually adjusting provisioning in response to external events. I worked in a team where this was a huge challenge. If you overprovision, you can end up paying thousands of dollars for unused capacity. People came up with custom solutions like Lambda scripts, sometimes even adding a cache to reduce calls to DynamoDB during peak load (Caching has other uses as well.) Then there were third-party solutions as well. Dynamic DynamoDB was arguably the most widely used. It’s open-source and has additional features such as time-based auto scaling and granular control over how many read and write capacity units to add or remove during auto scale up or down. However, now that this feature is built-in to DynamoDB and enabled by default, I doubt that third-party solutions will be used very much, if at all. So long and thanks for all the fish.

Auto Scaling is an optional feature and you must enable it explicitly using the ‘Auto Scaling’ settings in the AWS Console, as shown in the image below. You may also enable this using AWS SDK.

Auto Scale Target Utilization

In the image above, I’m specifying that the table utilization (target utilization) is kept at 80%. If the utilization exceeds this threshold, auto-scaling will increase the provisioned read capacity. Likewise, if the utilization falls below 80%, it will decrease the capacity to reduce unnecessary costs. In other words, auto-scaling uses an algorithm to adjust the provisioned throughput of the table so that the actual utilization remains at or near your target utilization.

Target utilization is useful to deal with sudden increases in traffic. Why? Because auto-scaling doesn’t kick in and take effect instantly. It takes time (~5 to 10 minutes to take effect.) If you set target utilization to a very number e.g. 100%, the benefit you’d get is that you’d always pay for used capacity. The drawback is that if the demand on the table increases suddenly, you’ll likely see throttling errors while DynamoDB auto scaling adjusts capacities and increase to match demand after you run out of your burst capacity quota. As a general rule of thumb, 70%-75% is a good number.

DynamoDB Auto Scaling Pricing

There is no additional cost to use DynamoDB Auto Scaling. You only pay for the provisioned capacity, which auto scaling will adjust for you in response to traffic patterns. If you do it right, auto scaling will reduce costs and ensure you don’t pay for unused capacity. One important thing to keep in mind is not to set the maximum provisioned capacity too high. Because if you do, you risk scaling all the way up to that capacity and be responsible for costs. Estimate what’s the maximum cost you could bear, and then set maximum capacity accordingly.

DynamoDB Auto Scaling vs On Demand

DynamoDB On-Demand is a related feature for managing throughput of a table. With On-Demand, you do not do any capacity planning or provisioning. You don’t specify read or write capacities anywhere. None, Nada, Zilch. Instead, you only pay for what you use. On-Demand was introduced in 2018, a year after Auto Scaling was launched.

On-Demand sounds too good to be true. If you think about it, why would anyone ever use Auto Scaling if On-Demand is available? Is there a catch?

Yes, there’s a catch. It’s cost. On-Demand is more expensive than Auto Scaling (provisioned capacity.)

On-Demand: Cost for 1 million write requests is $1.25
Provisioned Capacity: Cost for 1 million write requests is roughly $0.20

On Demand is about 6 times more expensive than provisioned capacity.

When to use Auto Scaling vs On Demand?

Auto Scaling works great when you have predictable traffic patterns or if you can approximate based on historical data. The other reason for using provisioned capacity is cost. At my last company, DynamoDB was our main database and cost control was a constant challenge. As much as I loved On-Demand, I couldn’t justify the 6x cost. Besides, Auto Scale works really well in most scenarios. The only time we had issues was when there was a sudden traffic spike (in game event) but that was a problem across the infrastructure not just DynamoDB. Another downside of On-Demand is the unexpected bill due to huge traffic increase. I’m not aware of a way to control this at the moment.

On the other hand, if the traffic volume is low and the traffic patterns are unknown, On-Demand might make perfect sense. For example, if you estimate doing a max of 1 million requests a month, $1.25 vs. $0.18 a month doesn’t really make that much of a difference.

DynamoDB and DAX

Jeff Barr wrote a blog post introducing the new feature and has screenshots of CloudWatch graphs. Check it out if you want to learn more. One key takeaway from his blog is that auto scaling isn’t ideal for bursts:

DynamoDB Auto Scaling is designed to accommodate request rates that vary in a somewhat predictable, generally periodic fashion. If you need to accommodate unpredictable bursts of reading activity, you should use Auto Scaling in combination with DAX (read Amazon DynamoDB Accelerator (DAX) – In-Memory Caching for Read-Intensive Workloads to learn more).

If you enjoyed this post, please share using the social sharing icons below. It’ll help CodeAhoy grow and every share counts. Thank you!

AI Is Not Magic. How Neural Networks Learn

2017-07-28T00:00:00+00:00

In my previous blog post, I claimed that “AI is not magic.” In this post, my goal is to discuss how neural networks learn, and show that AI isn’t a crystal ball or magic, just science and some very slick mathematics. I’ll keep this very high level.

Let’s start with a hypothetical scenario. Suppose we are building an app to identify hot dogs. Take a picture and the app will tell you if it’s a hotdog or not. Total App Store domination.

To recognize images, we choose to implement a popular machine learning algorithm called the neural network. (In this hypothetical scenario) This decision was made after reading an online article which talked about how neural networks can learn to recognize objects by training on lots of labelled examples. Once trained, it can start identifying images it has never seen before. We go ahead and obtain a training set of 6000 images gathered from online sources. 1000 images of different types of hotdogs: New York vs Chicago, ketchup, no ketchup, hotdogs on a grill, etc. The other 5000 images are of various non-hotdog objects: shoes, hamburgers, burrito, human legs. Now all that remains is to build our neural network.

To understand neural networks, we must first understand its elementary building block: the artificial neuron.

An artificial neuron takes one ore more inputs and produces a single output.

Looks familiar? It looks a lot like logic gates, which are elementary building blocks of digital circuits. The similarity ends there. Unlike logic gates, neurons can have several inputs and can change output for the same input values. This is possible because neurons have weights associated with each input, which is multiplied with the input value. These weights allow neurons to rate how important each input is. E.g. if the second input to a neuron isn’t very important, neuron can assign it a weight close to 0 essentially cancelling it out. Neurons also have biases which controls how easy it is to get neuron to output or fire. If the bias is huge, the neuron will fire very easily. I may have simplified, but what I have just described is the most basic type of neuron called the perceptron. In real world, we use more complex types of neurons.

An individual neuron is nifty, but it is not enough for sophisticated decision making. We want to group a bunch of these neurons together to form a neural network where different neurons get tuned on different aspects of the image. For example, a subset of neurons may only fire when they detect sliced bun, while others may fire when they detect sausage. All these sub decisions are weighed in the output layer before the final judgement is passed. For example, if the sausage neurons aren’t firing but the sliced bun one’s are, the output layer will classify the image as not hotdog because it could be something else with a sliced bun, e.g. Philly cheese steak sandwich. Output layer will assign more weight to the input coming from sausage neurons because there’s a very high chance that the image is of hotdog when the sausage neurons are firing.

When we connect bunch of neurons together, we get a neural network.

This particular type of neural network is called feedforward because information just flows in one direction and there are no cycles. Let’s also quickly talk about how we’ll input images to our neural network. Suppose that all images are the same resolution, say 128 by 128 pixels. We represent each image as a 2d array where each element of the array contains color information for the corresponding pixel. This 2d array of pixel color values is fed to the input layer of our neural network which contains 128*128=16384 neurons.

Back to weights and biases. The ability to change output for the same input values is arguably the most important feature of an artificial neuron (and by extension, the neural network) and it is the key to its learning. The goal of learning is to find the best combination of weights and biases for the neural network. Let’s express this objective formally so we can measure it and give it a name: cost function:

Cost Function (weights,biases) = (# of images incorrectly identified) ÷ (Total # of images)

Great. Now we can measure the performance with a clear objective: find weights and biases which minimize the cost function. How to find best weights and biases? One way is to just randomly pick them, run neural network over entire training data (6,000 images) and calculate the cost function which is the ratio of images incorrectly identified and the total number of images. Keep repeating until we’ve found a low enough value of the cost function that we like. Let’s say we want to stop when the neural network has a success rate of 99%. When we reach this goal, the combination of weights and biases reflect some similarity between hotdogs and our neural network can identify never seen before hotdog images. Let’s try it out. To keep it simple, suppose we are only doing this for 2 weights and biases.

Iteration 1: Weight1 = 1, Weight2 = 3, Bias1 = 1, Bias2 = 4. Let’s say it correctly classifies 300 out of 500 images of hot dog correctly. We’ll say the error rate is 200/500 = 40%

Iteration 2: Weight1 = 14, Weight2 = 6, Bias1 = 3, Bias2 = 0.2. This time is classifies 250 out of 500 images of hot dog correctly. The error rate is 200/500 = 50%, worst than last time.

Iteration 3: Choose weights and biases randomly again, roll the dice and error rate is 35%. Hmmm…

…

I hope you can see why this approach won’t work. There is no science, even heuristics to it. If the luck isn’t on our side, we could keep iterating until the end of time. And this is with just 2 weights and biases. In reality, it’s not uncommon for neural networks to have hundreds or even thousands of weights and biases. The shot in the dark approach just isn’t practical, it’s impossible.

But thanks to mathematics and calculus, we have a way of finding optimum weights and biases quicker, or in other words, minimizing the cost function in a scientific manner. This is done using an algorithm called the ‘Gradient Descent.’ The first time it runs, the algorithm picks up random weights and biases, but in subsequent iterations, it doesn’t chooses randomly but in a calculated manner to minimize the cost function even further. It keeps iterating until it finds the minimum value of the cost function. At this point, our neural network is said to have been ‘trained’ and it could start classifying pictures that it hasn’t seen before.

Of course, there is much, much more happening under the hood. If you are interested, Andrew Ng has an excellent lecture on gradient descent and I encourage you to watch the video so understand it in more detail.

Let’s revisit our cost function:

Cost Function (weights,biases) = (# of images incorrectly identified) ÷ (Total # of images)

This cost function is too simple and isn’t practical for gradient descent. Gradient descent cannot make incremental updates to weights and biases to minimize it. We need a smooth cost function. In practice, people usually use a quadratic cost function also known as the mean squared error (MSE).

Here Ŷ is the prediction of the learning algorithm, Y is the actual value and n is the size of the training data. It’s a measure of how close predictions are to actual values. Data scientists often multiply the cost function by 1/2 because that way the squared term is easier to cancel when taking derivative of the function. Speaking of derivative, gradient descent minimizes the cost function by taking its derivative (slope at that point) and moving in the direction where it is decreasing.

Unlike the first simplified cost function we created, MSE is a smooth cost function. In each iteration, gradient descent updates weights and biases to move downhill to the lowest point where the cost function is minimum. As a side note, there are many types of cost functions but MSE works well in many cases.

There is one more concept that you should know: the learning rate. The learning rate is the amount by which gradient descent updates weights and biases in each step. If the learning rate is too low, the algorithm may take a very long time to find the minimum. If the learning rate is too high, the algorithm may overshoot and miss the minimum (i.e. jump to the other side of the hill.) In practice, several parameters like learning rate and others that we’ll see in later posts needs to be adjusted to get the best results and performance.

Gradient descent, or rather its variations and several optimizations (as we’ll see in later posts), remains in wide use in machine learning algorithms like linear regression and neural networks. In linear regression, it gives us the line that best fits the data we are modeling; in neural networks, it gives us the best weights and biases.

Is AI magic? Magic of matrix multiplication and gradient descent, sure. But not smart enough to take over and destroy the world. Many CEO’s and executives overestimate the power of AI because of the unrealistic picture painted by the media. AI is extremely powerful and is proving itself with positive ROI in many domains, but it also has its limitations and it is not an off-the-shelf crystal ball. You should understand what the AI can and cannot do and then incorporate into your overall strategy. A good rule of thumb:

If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future. (-Andrew Ng)

See you next time.

AI Winter is Coming?

2017-07-27T00:00:00+00:00

What is AI Winter?

AI winter is a period of ‘reduced funding and interest in the field of artificial intelligence.’ AI winters are preceded by hype cycles and ambitious claims of what AI can do. Money into research and AI companies pours in and expectations are inflated. But it doesn’t last and after a while, pessimism takes over the community and spreads to press, investors and government. Budgets are slashed, funding is stopped and AI research virtually dries up. There have been two AI winters: first one in the 1970’s and the last one was in the 1980’s.

Is Another AI Winter Coming?

We are there again as far the hype is concerned. There has been no shortage of buzz around AI in the past few years. Everyone, everywhere is talking about AI and how it can predict revenues, increase sales, create chatbots who can do natural language conversations like just like real customer service people.

Is the boom going to end soon? Andrew Ng, chief scientist at Baidu research and a prominent figure in the AI community doesn’t think so. The advancements in computing power and availability of huge amounts of training data are providing “the fuel required to make emerging AI techniques feasible.”

“There are multiple experiments I’d love to run if only we had a 10-x increase in performance,” Ng adds. For instance, he says, instead of having various different image-processing algorithms, greater computer power might make it possible to build a single algorithm capable of doing all sorts of image-related tasks.

Addressing concerns about hype, Andrew Ng says:

“There’s definitely hype,” adds Ng, “but I think there’s such a strong underlying driver of real value that it won’t crash like it did in previous years.”

Andrew Ng have good reasons to be optimistic. So let’s look at the other side of the coin and some recent AI misses. Consider the case of IBM Watson and claims that it’s going to eradicate cancer:

It was one of those amazing “we’re living in the future” moments. In an October 2013 press release, IBM declared that MD Anderson, the cancer center that is part of the University of Texas, “is using the IBM Watson cognitive computing system for its mission to eradicate cancer.”

Well, now that future is past. The partnership between IBM and one of the world’s top cancer research institutions is falling apart. The project is on hold, MD Anderson confirms, and has been since late last year. MD Anderson is actively requesting bids from other contractors who might replace IBM in future efforts. And a scathing report from auditors at the University of Texas says the project cost MD Anderson more than $62 million and yet did not meet its goals.

In one of the many Watson ad campaigns IBM ran, Watson tells Bob Dylan that he has read all his lyrics and that the meaning of Dylan’s music is all about ‘time passing and love fading.’

Did Watson get it right? Roger Schank disagrees:

Really? I am a child of the 60s’ and I remember Dylan’s songs well enough. Ask anyone from that era about Bob Dylan and no one will tell you his main theme was “love fades”. He was a protest singer, and a singer about the hard knocks of life. He was part of the anti-war movement. Love fades? That would be a dumb computer counting words. How would Watson see that many of Dylan’s songs were part of the anti-war movement? Does he say anti-war a lot? He probably never said it in a song.

IBM Watson is a good product and I won’t be so harsh on it. Their ads and marketing efforts got lots of people and companies interested in AI. A colleague of mine told me that his CEO rolled his eyes (metaphorically speaking) when he told him to use machine learning to make predictions about customer behavior and revenue a few years ago. While he will never know for sure if it was because of IBM PR, but the CEO changed his mind recently and now thinks of AI as crystal ball that will transform his business and bring millions and millions of dollars in revenue.

AI is not magic. It’s just science and mathematics.

I used Watson to build a gaming chatbot with AI that could have conversations with users on a small number of topics. Watson uses Natural Language Processing and Machine Learning algorithms to understand user messages and requires lots of training. We weren’t able to get it to converse at a level of a 2 or 3-year old and it was easily confused. In the end, we decided to use a cloud based API called Wit.ai which gave us pretty much the same results as Watson.

Chatbots with AI replacing humans to provide customer service is a long, long shot. That doesn’t mean that chatbots or AI isn’t useful. Even if chatbots can increase 1%-2% of customer service efficiency for a major enterprise, it will result in huge savings. Chatbots can troubleshoot simple issues or prescreen users before passing them off to a live agent. Richard Socher, chief scientist at Salesforce said:

“If we were to make the 150,000 companies that use Salesforce 1 percent more efficient through machine learning, you would literally see that in the GDP of the United States,” he says.

Hype wasn’t the only reason for the last two AI winters, although it did play a big hand. It initially fueled the interest but failed to live up to the expectations and didn’t provide real value to corporations and government. While the hype has outpaced the reality once again, it is different this time. Aside from obvious beneficiaries like Google, Amazon, Microsoft and Tesla who are sitting on mountains of data and resources, medium to small sized companies are applying AI into direct actions to grow revenue and see positive ROI. In the mobile gaming domain, publishers and studios are experimenting with dynamic pricing for in-app purchases. For a large gaming company, even a slight improvement over traditional models such as segmented or A/B pricing, boosts revenue. Likewise, mobile operators are using machine learning to predict when prepaid subscribers will next recharge and use this prediction to grant or deny loans.

Is another AI winter on the horizon? I don’t think so. In fact I believe the interest, investment and research will continue to grow. It is already providing immense utility to not just the big corporations or companies building self-driving cars, but to companies of all sizes and in different industries.

However, we must curb our expectations a little. AI won’t out-think humans anytime soon, understand deep meaning of music, eradicate cancer, or replace customer service people entirely, but it will instead provide real value in many domains and incremental improvements.

AI is not magic and the hype will die down, but the next AI winter will be more like California winter, not Canadian.

See you next time.

Fix Employee Weaknesses or Focus on Their Strengths?

2017-07-26T00:00:00+00:00

In “First, Break All the Rules: What the World’s Greatest Managers Do Differently” the authors, Marcus Buckingham and Curt Coffman, have put together their observations from more than 80,000 Gallup interviews they conducted with various leaders and managers over a period of 25 years. The book is full of excellent insight into what great managers do and don’t do and debunks several traditional management myths. One such myth is that people are capable of almost anything if they work hard enough or everyone has unlimited potential. According to the authors, this is a complete fallacy and while it is an uplifting thought, it is far from reality.

“There once lived a scorpion and a frog.

The scorpion wanted to cross the pond, but, being a scorpion, he couldn’t swim. So he scuttled up to the frog and asked: “Please, Mr. Frog, can you carry me across the pond on your back?”

“I would,” replied the frog, “but, under the circumstances, I must refuse. You might sting me as I swim across.”

“But why would I do that?” asked the scorpion. ”

“It is not in my interests to sting you, because you will die and then I will drown.”

Although the frog knew how lethal scorpions were, the logic proved quite persuasive. Perhaps, felt the frog, in this one instance the scorpion would keep his tail in check. So the frog agreed. The scorpion climbed onto his back, and together they set off across the pond. Just as they reached the middle of the pond, the scorpion twitched his tail and stung the frog. Mortally wounded, the frog cried out: “Why did you sting me? It is not in your interests to sting me, because now I will die and you will drown.”

“I know,” replied the scorpion as he sank into the pond. “But I am a scorpion. I have to sting you. It’s in my nature.”

In this old parable, the frog made a fatal mistake in believing that scorpion’s nature will change.

Great managers reject this out of hand. They remember what the frog forgot: that each individual, like the scorpion, is true to his unique nature … They know that there is a limit to how much remolding they can do to someone. But they don’t bemoan these differences and try to grind them down. Instead they capitalize on them. They try to help each person become more and more of who he already is.

Under the same situation, different people react differently according to their nature. People are motivated differently. For example, I worked with a software developer who was very competitive by nature and his productivity would go through the roof when he heard that someone else on the team did it better or faster. That was his trigger. If a task carries too much risk, it is best assigned to a person who is meticulous than to someone who is a risk taker.

Everyone has some talents which the authors define as ‘recurring patterns that could be applied productively.’ Willpower is a talent. So is empathy and competitiveness. They key is to select and hire for talent and cast in the right role. This includes identifying an individual’s talents and assigning responsibilities which maximizes the strengths and neutralizes weaknesses.

Casting for talent is one of the unwritten secrets to the success of great managers. On occasion it can be as simple as knowing that your aggressive, ego-driven salesperson should take on the territory that re quires a fire to bel it beneath it. And, by contrast, your patient, relation ship-building salesperson should be offered the territory that requires careful nurturing.

This may sound like common knowledge but all too often hiring managers put excessive emphasis on skills and experience over talent. Skill or how-to’s of a role can be taught. Talent cannot be taught. A Java software developer can learn Python, but may not become a good marketer. An aggressive, ego-driven person generally makes a poor team player but put that person in a situation that requires a fire to be lit under it, and that person might just become a rockstar.

People don’t change that much. Don’t waste time trying to put in what was left out. Try to draw out what was left in. That is hard enough.

This is the essence of the ‘focus on strengths’ school of thought. There is some scientific evidence to support this theory:

Beyond a person’s mid-teens, that unique network of synaptic connections, in which some are strong and robust and others non-existent, does not change significantly. This means that a person’s recurring patterns of thought, of feeling and of behavior do not change significantly. If he is empathic when he is hired, he will stay empathic. If he is impatient for action when he is hired, he will stay impatient.

There is also criticism of the ‘focus on strengths’ based approach. Dr. Tomas Chamorro-Premuzic suggests that focusing too much on our strengths can be counterproductive:

it’s important to understand that even the smartest, brightest, and most brilliant individuals have a dark side. They have certain elements of their personality, of their typical behaviors, that are quite counterproductive. And if those tendencies are left unchecked, no matter how smart, competent, and talented they are, their careers at risk of derailing.

Think of an employee or an individual who is very driven and ambitious. If we developed their ambition and drive even further, they might just become greedy. Or somebody who is very socially skilled, if they develop their social skills even further, they might become almost Machiavellian and manipulative. People who are very creative can become odd and eccentric, and people who are already a little bit confident, if we make them even more confident, they might become arrogant or overconfident.

I generally agree with the idea of focusing on strengths as too many managers focus on irrelevant weaknesses or non-talents of their reports. There isn’t enough time to change an employee’s nature even a little or to give birth to a new talent. Does this mean we should completely ignore weaknesses? If the weakness is relevant and it is affecting performance, the manager must determine if the weakness is trainable (i.e. missing skill), whether the person is casted in the wrong role or if the person can be paired up with someone who has complementary strengths. Either way, poor performance should be tackled head on as soon as possible.

Until next time.

Tweaking TCP for Real-time Applications: Nagle's Algorithm and Delayed Acknowledgment

2017-03-19T00:00:00+00:00

TCP is a complex protocol.

Don’t get me wrong. It is a marvelous piece of engineering that gives us the reliable data transmission guarantee that other protocols don’t provide. Reliable data transmission between two devices on the internet is no walk in the park and TCP uses a lot of magic under the hood to make things happen. Generally, it does a fine job of abstracting away low level details and its default settings work fine for most general purpose use cases. However, once in a while, things don’t go according to plan and we need to pop open the hood and do some tweaking. It is in these situations, that some knowledge of TCP comes in very handy.

By default, TCP uses two buffering techniques to optimize and minimize overhead for general purpose applications. However, if you are building applications that require real-time message delivery for small messages (e.g. chat or control messages), you must have some knowledge of these techniques.

Nagle’s algorithm
TCP delayed acknowledgment

Let’s look at them in more detail.

Nagle’s Algorithm

If there’s no congestion, TCP tacks on a header and sends data out as soon as it gets it from the application. If the application is generating a lot of small messages, the headers can add a lot of overhead: TCP/IP headers are 40-byte, so 1-byte of data is sent as 41-byte packet on the network. A computer programmer named John Nagle came up with an algorithm to reduce the overhead by combining many small messages into a single message. Nagle’s algorithm, named after its inventor, is a technique to make TCP more efficient by reducing the number of packets that are sent over the network. Here’s the pseudo code for the algorithm:

if there is new data to send
  if the window size >= MaximumSegmentSize and available data is >= MaximumSegmentSize
    send complete MaximumSegmentSize segment now
  else
    if there is unconfirmed data still in the pipe
      enqueue data in the buffer until an acknowledge is received
    else
      send data immediately
    end if
  end if
end if

So what the algorithm is saying is that if the data to be sent is smaller than the maximum segment size (MSS) (~ 1.4 KB), it is sent immediately ONLY if TCP has received acknowledgment for all the data that was previously sent. Another way to see it: if the newly generated data on the sender is small and its TCP is waiting for the receiver to acknowledge receipt of data that is in flight, Nagle’s algorithm will tell TCP to buffer the data and it won’t be sent immediately.

Nagle’s algorithm works great for most TCP applications like video streaming that produce data at a very high rate which exceeds the MSS very quickly and has to be sent out. But for multiplayer gaming servers, it creates performance issues. Tens of thousands of clients connect to a front-end server which sits between clients and backend gaming services. Even though the overall throughput is very high, only a small amount of data (a chat message roughly 100 bytes) is available for each client/socket and hence the data is buffered using Nagle’s algorithm.

So if you are building TCP applications that expect responses to arrive in real-time, Nagle’s algorithm will result in poor performance and higher latency. Nagle’s algorithm is turned on by default at the system level, but you can disable it for you application using TCP_NODELAY socket option.

However, Nagle’s algorithm is not the only culprit when it comes to higher message latencies for real-time application. As we’re about to see, another type of buffering on the receiver side, prevents acknowledgments for received data to be sent out immediately.

TCP Delayed Acknowledgment

TCP delayed acknowledgment is an optimization technique to combine multiple acknowledgments (ACKs) into a single response to reduce the overhead. Upon receiving data, receivers (using delayed acknowledgment) don’t send acknowledgment right away but instead wait for a few hundred milliseconds (200ms to 500ms) so it can be sent bundled together with any other acknowledgments or data that it might generate in during that window.

Nagle’s algorithm (sender side) and TCP delayed acknowledgement (receiver side) is a double whammy for real-time applications: receivers wait a few hundred milliseconds before acknowledging senders. Without receiving acknowledgement, senders keep on buffering small packets until they receive acknowledgement or the buffered data exceeds maximum segment size, which can take a long time (> 200ms) if the data is small. This double buffering wreaks havoc on real-time applications and increases message latency. Nagle’s algorithm would probably perform better without the TCP delayed acknowledgment, which can be disabled using the TCP_QUICKACK socket option. However, it’s not always possible to control behavior of client devices. Besides, disabling Nagle’s algorithm on the server for real-time applications does the job.

Until next time!

Cluster Analysis Using K-means Explained

2017-02-19T00:00:00+00:00

Clustering or cluster analysis is the process of dividing data into groups (clusters) in such a way that objects in the same cluster are more similar to each other than those in other clusters. It is used in data mining, machine learning, pattern recognition, data compression and in many other fields. In machine learning, it is often a starting point. In a machine learning application I built couple of years ago, we used clustering to divide six million prepaid subscribers into five clusters and then built a model for each cluster using linear regression. The goal of the application was to predict future recharges by subscribers so operators can make intelligent decisions like whether to grant or deny emergency credit. Another (trivial) application of clustering is for dividing customers into groups based on spending habits or brand loyalty for further analysis or to determine the best promotional strategy.

There are various models and techniques for cluster analysis. When I first started, I was mistakenly searching for ‘the best clustering model or technique.’ I wasn’t aware that there is no universal best algorithm and the choice depends on your requirements and the dataset. There are density-based, graph based or centroid based clustering models. We finally settled on a clustering technique called k-means. This blog post is a brain-dump of everything I’ve learned about clustering and k-means so far.

K-means

K-means is a very simple and widely used clustering technique. It divides a dataset into ‘k’ clusters. The ‘k’ must be supplied by the users, hence the name k-means. It is general purpose and the algorithm is straight-forward:

We call the process k-means clustering because we assume that there are k clusters, and each cluster is defined by its center point — its mean. To find these clusters, we use Lloyd’s Algorithm: we start out with k random centroids. A centroid is simply a datapoint around which we form a cluster. For each centroid, we find the datapoints that are closer to that centroid than to any other centroid. We call that set of datapoints its cluster. Then we take the mean of the cluster, and let that be the new centroid. We repeat this process (using the new centroids to form clusters, etc.) until the algorithm stops moving the centroids.[0] We do this in order to minimize the total sum of distances from every centroid to the points in its cluster — that is our metric for how well the clusters split up the data.

Here’s an animation (from - broken link: http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/) showing how it works.

As far as its performance is concerned, k-means and its variants can usually process large datasets very quickly, as long as the number of clusters isn’t very high.

Finding ‘k’: number of clusters using the elbow method

If you know the number of clusters before hand, you have everything you need to apply k-means. However, in practice, it’s rare that the number of clusters in the dataset is known. When I’m modeling, I’m not sure if there are 3 clusters or 13 in the dataset. Luckily, there’s a technique called the ‘elbow method’ that you can use to determine the number of clusters:

One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.

Silhouette analysis is another popular technique for finding the optimum number of clusters visually, and the one we actually used in our application.

K-means limitations and weaknesses

Unfortunately, k-means has limitations and doesn’t work well with different types of clusters. It doesn’t do well when:

the clusters are of unequal size or density.
the clusters are non-spherical.
there are outliers in the data.

I have read that the dataset must have well-separated clusters in order for k-means to properly work. However, in practice, I have experienced k-means doing a pretty good job of diving data up into clusters, even when the clusters are not well separated or obvious. I will keep it off the list.

1. The clusters are of unequal size or density

K-means won’t identify clusters properly if clusters have an uneven size or density. To illustrate the point, I generated a 2d dataset.

I see 3 clusters of uneven densities. Do you?

Let’s see how k-means does. I wrote a simple python script with scikit-learn and ran k-means with k=3.

As it was expected, k-means messed up and couldn’t find naturally occurring clusters that we recognized in the dataset. It identified the larger cluster correctly but mixed the smaller ones. If you have datasets where naturally occurring clusters have unequal densities, try density based models.

2. The clusters are non-spherical

Let’s generate a 2d dataset with non-spherical clusters.

It’s how you look at it, but I see 2 clusters in the dataset. Let’s run k-means and see how it performs.

Looking at the result, it’s obvious that k-means couldn’t correctly identify the clusters. If you have a similar dataset, try a hierarchical or density based algorithm like spectral clustering or DBSCAN that are better suited.

3. There are outliers in the data.

If outliers are present in the dataset, they can influence clusterings results and change the outcome. The dataset should be pre-processed before applying k-means to detect and remove any outliers. There are many techniques for outlier detection and removal. Unfortunately, discussing these techniques is beyond the scope of this post and my expertise.

Evaluating k-means results

Let’s assume you have just applied k-means to a dataset. How do you tell if k-means did its job right and identified all clusters correctly? If your dataset is 2-dimensional and isn’t very large, you could probably plot the results and visually inspect to assess k-means, just like how I’ve been showing results to you so far. But what if your dataset is high-dimensional, as is often the case? We are trapped and can’t visualize data beyond 3-dimensions. So we need some way to evaluate result of k-means and there are a couple that I know of:

Supervised evaluation
Unsupervised evaluation

1. Supervised evaluation

Supervised evaluation can be used to check the results of k-means when you have a pre-classified dataset available that can act as a benchmark. Typically, an expert would carefully process a small dataset assigning datapoints to clusters manually. This is the gold standard for evaluation and tells us how close the results are to the benchmark. Keep in mind that in order to use supervised evaluation, you’d need to be able to train k-means and then supply the pre-classified benchmarks to check expectations against actual results (not possible if you are using a standalone app like Weka or Tableau). This page describes how to use supervised evaluation in scikit-learn, a popular python machine learning library.

2. Unsupervised evaluation

Unsupervised evaluation doesn’t rely on external information. In our app, we used inter-cluster (within cluster sum of squares) and intra-cluster (between cluster sum of squares) variances to decide if the results were good enough. These are just fancy terms to describe cohesion of points that are within a cluster and separation between clusters. We want good cohesive clusters that are well separated from others.

For more information on, this wikipedia article has more information and describes several algorithms.

That’s everything I have learned about k-means and clustering in general. If you have any comments or want to suggest improvements to this post, please leave them in the comments section below.

Certificate Authorities - Do You Know Who You Trust?

2017-02-18T00:00:00+00:00

HTTPS (aka HTTP over the secure TLS protocol) provide a secure communication channel between web browsers and servers to guard against man-in-the-middle attacks. Although researchers have identified and reported a few vulnerabilities, TLS is still the best option out there and all websites should be using it.

Arguably, the most famous TLS fiasco was not a vulnerability but an enormously miscalculated and incompetent attempt to increase ad revenues by a top selling laptop manufacturer. “Lenovo incident / scandal” got its start when Lenovo thought it would be a brilliant idea to pre-install an adware (Superfish) on their laptops to inject ads on webpages, both encrypted and non-encrypted. To allow Superfish to view and alter encrypted traffic, they pre-loaded its self-signed, root certificate on their laptops. By doing this, Lenovo let Superfish become man-in-the-middle and allowed it to view and alter traffic without the user ever knowing. If that was the end of the story, it might not have been all that terrible. However, by installing the self-signed root certificate, which used the same private key on all laptops, Lenovo exposed sensitive and confidential communication of their users to attackers or eavesdroppers connected to the same WiFi. Communication including emails, bank transactions, messages, and passwords were all exposed. After users complained and there was public outcry, Lenovo finally acknowledged that it “messed up” and apologized to users for betraying their trust. It quickly dumped Superfish. Microsoft stepped in and provided an update of Windows Defender to remove Superfish.

As someone who purchased a Y50 around the same time (2014), I’m happy to report that Superfish is dead. (So long, and no-thanks for the fish.)

It is not always the incompetence of laptop vendors - user trust has been violated intentionally for malicious purposes as well:

a Chinese certificate authority issued valid security certificates for a number of domains, including Google’s, without their permission, which resulted in a major trust breach in the crypto chain.

CNNIC had delegated its authority to Egyptian intermediary MCS Holdings to issue the certificates in question and the company installed it in a man-in-the-middle proxy internally.

If China feels far away, Symantec fired its employees after it discovered they issued fake google security certificates.

As a user, you can protect your communication and your privacy by never ignoring security warnings from your web browser. Web browsers, especially Chrome and Firefox, do a pretty good job of recognizing potential threats and warning users, so don’t skip through unless you are absolutely sure what you are doing. They also have a “Removal of Trust“ policy where they would distrust a root certificate authority if it is compromised, even if it is trusted by the operating system.

See you next time!

Testers Make Software Teams Highly Productive

2017-02-17T00:00:00+00:00

To put it mildly, developers are not great at testing their own products. Bias, pride, wrong assumptions, lack of time, switching contexts, all play a role in making developers ineffective at testing their code. Most companies, especially startups, don’t fully understand the role of a tester. Very early on in my career, we made the mistake of hiring people to be testers who applied for a software developer position but weren’t good enough programmers. We paid for it in terms of software quality and over-worked team. It wasn’t until I worked with some great testers that I realized how effective and productive software teams become when they have great testers on board.

The best testers or quality assurance engineers I’ve ever worked with weren’t developers. One was a DevOps guy and the other was a Network Engineer. They became testers coincidentally because they were very smart and tremendously effective at finding bugs, even when they tested the system as a black-box. They treated the whole exercise like puzzle solving. Here’s an example bug report from them:

“Found an issue with API to retrieve all widgets right after registering as a new user. Requests kept timing out. Upon digging further, I discovered that the system is throwing NullPointerExceptions. This is happens because the xyz counter in the database wasn’t properly initialized in the previous step. Initialize the counter in the previous step and also catch all un-handled exceptions and return an error to the user.”

Or this one.

“Found an issue. The system throws exception when the user selects option 5. I checked the logs and found nothing. I ran wireshark to look at the request and response and found the issue to be caused by developers sending a unicode character in the request. Don’t send unicode characters and log error reasons which is field_10 in the XML payload.”

These bug reports not only clearly identified the issue, but also the root cause and suggested a better course of action, thus refining the product and improving its quality.

Some organizations naively assume that automated tests written by developers can replace testers. Wrong, very wrong. I love automated tests. But a good tester finds deficiencies and suggests improvements that a developer or an automated suite may overlook:

Tester: When I pass a string instead of a number for the billing amount, the app doesn’t capture it. You need to check the type and ensure its a number.

Developer: What do you mean? I wrote the test for it just last week and my end-to-end client did in fact receive the status code 4000 which means it’s an error.

Tester: Yes, the error is returned. But the error comes from the database in the form of an exception when it tries to store a string where a number should go. This is inefficient because the database call is expensive.

Developer: Yeah. That makes sense when I think a about it. I will get it fixed.

Another great reason for having dedicated testers on the team is to provide positive reinforcement and closure to developers who may otherwise be doubtful whether they are on the right track or not. Developers usually breathe a sigh of relief when their releases are certified by testers and are told that everything works as expected.

A great tester gives programmers immediate feedback on what they did right and what they did wrong. Believe it or not, one of the most valuable features of a tester is providing positive reinforcement. There is no better way to improve a programmer’s morale, happiness, and subjective sense of well-being than ~~a La Marzocco Linea espresso machine~~ to have dedicated testers who get frequent releases from the developers, try them out, and give negative and positive feedback.

Until next time.

What Is Yak Shaving? Advice for Software Developers on Staying Focused

2017-02-13T00:00:00+00:00

Yak shaving is defined as:

what you are doing when you’re doing some stupid, fiddly little task that bears no obvious relationship to what you’re supposed to be working on, but yet a chain of twelve causal relations links what you’re doing to the original meta-task.

Fun-fact about the origin of the term ‘yak shaving’: The term was coined at the MIT AI Lab in the 90s. Its scientists got inspiration from an episode of the Ren and Stimpy show called “Yak Shaving Day”.

A picture is worth a thousand words. A video, perhaps billions. Here’s a video of Malcolm originally wanting to change a lightbulb but ending up chasing down yaks.

To understand it more clearly, suppose you are required to perform a task. We’ll call it task A. As you start working on task A, it leads you to another task, e.g. task B. Task B leads you to Task C, and so on. Before you know it, you are working on Task Z, completely distracted from your original goal of completing task A.

Here’s an example dialog in the world of software development.

Manager: “Did you fix the issue where we had to update the column name in our code because someone changed it in the DB? “

Software Developer: “Ah, not yet. I’m still working on it.”

Manager: “What happened? It was a one-line change.”

Software Developer: “As I looked into the code, I realized we were using a really old version of Hibernate. I tried to upgrade it but there were some breaking changes in the new version. They recommended switching to the Repository pattern so I refactored a few classes but now the DB is throwing errors. I’m debugging.”

In this fictional scenario, the developer ‘went down a rabbit hole,’ which had nothing to do with the original task of changing the column name in code. The key is staying focused and not letting distractions pull you away from the main goal. Assuming the developer was right that the code needed to be upgraded, he mixed the two unrelated tasks up. Stay focused, finish the original task. Re-evaluate and prioritize the second task of upgrading Hibernate version and then go back to it.

Moral of the story: Stay focused and stop chasing those yaks!

One thing I have always loved about startups is their ability to stay focused. Solo or even small teams of software developers at a startup can get things done much faster compared to their counterparts in large corporations because they have fewer yaks to shave. At a startup, you won’t hear: “I didn’t get any time to write code for the new feature because as I was creating the JIRA story, I realized that the epic needs to be split. I did that but then I had to move tasks and link them to the right macro on Confluence.”

I’m not suggesting that yak shaving in evil: sometimes, you have no choice but to go on side quests before you can reach your final destination. In fact, in larger companies, yak shaving is inevitable. Developers spend the vast majority of their time shaving yaks.

Ben Ramsey said it better: yak shaving “isn’t just part of our jobs, it’s the entire job description.” So the minute you start going down the yak shaving path, stop and ask yourself if shaving the yak is really necessary. The fewer yaks you have to shave, the faster you’ll get to your destination.

You might want to also familiarize yourself with YAGNI which stands for “You Aren’t Gonna Need It.. See you next time.”

Committing Teamicide by Micromanagement

2017-02-08T00:00:00+00:00

What is micromanagement?

Micromanagement is a “management style whereby a manager closely observes or controls the work of subordinates or employees”.

Micromanagement is bad. It hurts morale and works against making individuals or teams productive. An effective manager or a leader makes people use their brains instead of acting like mindless zombies who require constant babysitting and instructions. A micromanager is like a helicopter parent closely watching and monitoring the employees, which is very demoralizing especially to smart people.

While some managers micromanage out of fear or job insecurity, most do it because they don’t trust their direct reports to carry out the tasks as well as themselves or they fear that without their supervision, mistakes will be made. This lack of trust forces the manager to become defensive and assume that the only way employees will do good work is if they are being constantly monitored and reviewed and they cannot be trusted to make decisions on their own. Employees reporting to a micromanager become cynical and sometimes even despise the manager. They start acting in their own self-interest which is counter-productive and inhibits team formation.

An acquaintance once asked for my help on a project that was stuck. The technical manager who was in charge of a small team (less than 10 people) was drowning in work. I quickly realized that I was dealing with a micromanager. Employees were disenchanted and had very low morale or motivation to do a good job. They didn’t care. The micromanager, through his actions, made it clear to the team that he had no trust in them. He had put elaborate processes in place so that nothing could get marked as “done” until he had reviewed it down to the tiniest details. He told me that the team sucks and he was brought in to basically kick ass and get the project done. It was amusing because his lack of trust had become a self-fulfilling prophecy: he was frustrated that people were producing such low quality work and that that he had to redo everything and was swamped with menial coding tasks so much that he didn’t have time for anything else.

Chapter 20 of Peopleware is called Teamicide and describes a list of things to not do if you are trying to grow productive teams. Micromanagement is at the top of that list because if there’s one thing you cannot protect yourself against, it’s your own people’s incompetence:

It makes good sense for you the manager to take a defensive posture in most areas of risk. If you must work with a piece of failure prone gear, you get a backup; […]

There’s one area, though, where defensiveness will always backfire: You can’t protect yourself against your own people’s incompetence. If your staff isn’t up to the job at hand, you will fail. Of course, if the people are badly suited to the job, you should get new people. But once you’ve decided to go with a given group, your best tactic is to trust them. Any. defensive measure taken to guarantee success in spite of them will only make things worse. It may give you some relief from worry in the short term, but it won’t help in the long run, and it will poison any chance for the team to jell.

This is why companies should choose managers or leaders very wisely. Good leaders hire very, very carefully for both brains and cultural fit. They coach employees about their values and set goals and expectations clearly instead of holding hands all along the way. They show their employees that they are trusted to get the job done right and let them own it. Good leaders don’t treat mistakes at work like sins. They allow their employees to learn from their mistakes and develop skills and build knowledge. Trust is the key to growing productive teams.

There is however another problem on the other end of the spectrum: some managers fall into the opposite trap where they think that they can hire smart people who get things done and just disappear, leaving people on their own with vague goals and expectations. This style is called macromanagement, and it is equally as bad as micromanagement. You are not micromanaging if you provide direction, occasionally override a decision or help the team reach consensus. It requires the right balance.

In the end, it’s all about trust. Trust is a feeling and it takes time to build it. Trust build and jell teams - it makes people feel safe and when smart people feel safe and trusted, they do everything in their power to achieve the goal.

htop Explained Visually

2017-01-20T00:00:00+00:00

htop is an interactive process viewer and system monitor. It’s one of my favorite linux tools that I use regularly to monitor system resources. If you take top and put it on steroids, you get htop.

htop has an awesome visual interface that you can also interact with using your keyboard. The screen packs a lot of information which can be daunting to look at. I tried to find a nice infographic to explain what each number, value or color coded bars mean, but couldn’t find any. Hence I decided to make one myself over the Christmas break.

When you first launch htop, you’ll be greeted with a colorful interface showing a list of all processes running on the system. These are normally ordered by the amount of CPU usage, ordered from highest to lowest. It also shows the status of CPU usage, physical and swap memory.

There’s a lot information in the screenshot. To explain, I’ve separated the interface into upper and lower sections so I have enough room to annotate. (If you want to know, I used Photoshop to annotate the screenshot.)

Let’s start with the upper section. To see higher resolution, please click on the image.

Here’s the lower section of htop.

I hope you found this post useful. Here’s a comparison between top and htop, comparing different features and properties.

If you are using macOS, please note that htop doesn’t come installed by default. You can install it easily using brew. Open the Terminal and type:

$ brew install htop

After the installation is complete, you can launch it by typing htop on the Terminal:

$ sudo htop

Note: sudo is needed to give htop required access on macOS. On Linux, sudo isn’t required.

Until next time.

Review of Andrew Ng's Machine Learning Course and Next Steps

2017-01-19T00:00:00+00:00

Back when I was in college, I enrolled in a couple of introductory AI courses. I quickly got bored: artificial neural networks didn’t sound very practical and the dry mathematics was off-putting. After I finished the courses, I graduated and moved on. A while ago, I started noticing articles and blogs on self-driving cars that use “machine learning”. It sounded like a fancy new way to position the decades-old field of AI. I still wasn’t sure what the hype was all about. That was about to change.

One evening by chance, I came across a link to what sounded like a Mario video. Being a fan of retro video games, I opened it. The video shows a skilled player playing Super Mario World. Here’s the twist: the skilled player isn’t a human. It is a neural network program that taught itself how to play Super Mario World with zero help. When it started, it knew absolutely nothing about Super Mario World or Super Nintendo. It didn’t even know that pressing the A key on the controller makes Mario jump over obstacles. It learned to play and complete the first level all by itself… made possible by machine learning.

Needless to say, my mind was blown.

I wish my professors had shown something similar at the beginning of the course. I would have been all over it. It was a fantastic demonstration of the power of machine learning. I spent the weekend reading blogs and news articles about machine learning applications. I tried to run some simple applications but didn’t get far. I wanted to understand what problems it can solve. I wanted to apply it to a real-world problem.

At some point, I decided that I need to take a course so I could read and understand machine learning blogs and research papers. I researched online found a course on Coursera offered by Andrew Ng. For those of you who don’t know who Andrew is, he is a highly respected and very influential scientist in the fields of machine learning and AI. He’s leading machine learning efforts at Google and Baidu. Before that, he taught at Stanford as an associate professor.

My Review of the Machine Learning Course by Andrew Ng

I enrolled in Andrew’s course on Machine Learning and I’m super glad I did. If I have to rate Andrew’s course out of 5 stars, I would give it 6 stars.

It was literally one of the best learning experiences of my life. I had fun throughout and learned many useful concepts, many of which I was able to apply to solve real-world problems.

The course is introductory level and is designed for complete beginners to machine learning. You don’t need any prior experience with machine learning tools and libraries.
The course is 100% free. You’ll need to pay about $50 if you want the course certificate after completion.
The course itself is 11 weeks. I spent 3-4 hours a week. If you have more time, you can definitely finish it sooner.
Andrew Ng has an amazing teaching style. It’s super fun and very engaging. He clearly articulates complex algorithms and mathematical equations which make it very easy to grasp the subject matter.
The course will introduce you to various flavors of machine learning algorithms: linear regression, logistic regression, k-means, (artificial) neural networks, support vector machines, unsupervised learning. By covering many different algorithms, it lays the groundwork and sets up the foundation so you can continue learning in the areas that interest you.
Programming assignments focus on solving real-world problems: handwritten digit recognition using neural networks and spam classification with support vector machines (SVM) were my favorites.
A couple of friends who took the course complained about one aspect: the assignments must be done in MATLAB or Octave. They were hoping they’d be able to use their favorite language and learn a machine learning library like TensorFlow. However, I feel MATLAB/Octave is a great choice for this course. It forces you to think about applying machine learning algorithms using matrices and matrix operations, without getting caught in the nuances of a high-level language or a library. I would however, strongly recommend that you do not skip the tutorial sections covering MATLAB/Octave and pay very close attention to them. Review the tutorial sections twice if you need to, or you’ll spend a lot of time stuck on assignments.

If you are interested in machine learning (you should) and you are a beginner or know very little about it, Andrew’s course is the best investment of your time that you can make. The only regret you’d have is that you didn’t enroll sooner :)

What to do next?

Deep Learning is one of the most sought after skills in the field of machine learning, and it is transforming many industries. Coursera and Andrew Ng offer some great courses on Deep Learning:

In addition, you should also join Kaggle. It is an online community of data scientists and machine learning enthusiasts. It runs competitions which are slightly advanced for beginners but a good way to explore the field. For many beginners, Kaggle’s best feature is the no-setup, customizable, Jupyter Notebooks environments, and access free GPUs plus a huge repository code published by other developers.

Happy machine learning.

If Your Site Isn't Using HTTPS, You Are Doing It Wrong

2017-01-18T00:00:00+00:00

We live in a day and age where we simply cannot take our right to privacy for granted. When we communicate over unprotected channels, we expose our messages to everyone who happens to be along the way: The WiFi hotspots, corporate IT providers, ISPs, cloud providers, can listen in to our communication. We leave a trail of digital footprints behind. When aggregated, it can reveal information about ourselves. Eavesdroppers and intruders can make inferences about our behaviors and intentions: ISPs can determine what types of news stories we are interested in, employers can monitor our activities even on personal devices at work, look at our searches, see our messages, all when we communicate over unprotected channels.

Google has been urging site owners to switch to HTTPS for many years now. They started using HTTPS as a ranking indicator for their search results. To tighten the screws, Chrome, starting with version 56 that is coming soon, will start showing “not secure” alerts on sites that collect login or credit card information over HTTP. Firefox will also start displaying a red icon in the address bar as well as an in-context warning for pages that ask users to login over HTTP.

While I don’t know the real reason that compel Google to drive the HTTPS campaign, it’s a great direction for the future of the web, a direction that we should all support.

So how do you make your website secure? It’s way easier to secure sites with HTTPS these days than it used to be. In the past, obtaining a digital certificate that is required for HTTPS required paperwork and hundreds of dollars. This is no longer the case. Let’s Encrypt is a certificate authority that provides FREE certificates to anyone. It’s backed by organizations such as Mozilla, Facebook and Google to name a few. Let’s Encrypt makes it possible for anyone to have an HTTPS website for free. As an alternative, if you host your servers on the AWS, Certificate Manager provides free certificates and handle certificate renewals as an added bonus.

Getting an HTTPS-enabled website is easier (and cheaper) now than ever. If you are concerned that HTTPS slows things down, think again. HTTPS can even be faster than HTTP.

I didn’t intentionally touch upon security threats to unencrypted traffic since they are well known to most people. The privacy aspects, unfortunately, aren’t as well known.

I would also like to make a quick announcement: starting today, all traffic to my blog is 100% fully encrypted and secured using HTTPS. I enabled HTTPS without spending a single penny and the whole process took less than an hour. If you have a website that isn’t using HTTPS yet, what are you waiting for?

There's No Backdoor in WhatsApp. Just a Weakness That Could Be Exploited

2017-01-17T00:00:00+00:00

Last week, Guardian ran a story claiming that a backdoor built into WhatsApp can allows its parent company, Facebook, to read user messages despite advertising end-to-end encryption and complete privacy:

Facebook claims that no one can intercept WhatsApp messages, not even the company and its staff, ensuring privacy for its billion-plus users. But new research shows that the company could in fact read messages due to the way WhatsApp has implemented its end-to-end encryption protocol.

Open Whisper Systems, the nonprofit behind Signal protocol that powers WhatsApp’s end-to-end message encryption, came to WhatsApp’s defense and fired back at the Guardian story:

Today, the Guardian published a story falsely claiming that WhatsApp’s end to end encryption contains a “backdoor.” … The way this story has been reported has been disappointing. There are many quotes in the article, but it seems that the Guardian put very little effort into verifying the original technical claims they’ve made.

So what really happened? Tobias Boelter, who “discovered the vulnerability”, followed up to further explain and support the backdoor/vulnerability theory and why it matters:

… [WhatsApp] encrypted messaging works using secret and public keys. Every user has both a secret key known only to them, and a public key.

A user’s public key can be used to encrypt messages which can then only be made readable again with the associated secret key.

Okay, so public key encrypts and private key decrypts. Public key is publicly available to anyone, while private key stays private. Let’s continue.

A difficult problem in secure communication is getting your friend’s public keys. Apps such as WhatsApp and Signal make the process of getting those [public] keys easy for you by storing them on their central servers and allowing your app to download the public keys of your contacts automatically.

The problem here is that the WhatsApp server could potentially lie about the public keys. Instead of giving you your friend’s key, it could give you a public key belonging to a third party, such as the government.

So a third party, impersonating your friend, can give you its own public key and WhatsApp will overwrite your friends actual key with it thinking that it has changed. You will encrypt messages using the wrong key which allows the third party to read your messages. The third party can act as the man in the middle between you and your friend if it can compromise both, eavesdropping on the conversation while staying under your radar. But, in order to do any of this, it will require Facebook’s support and access to WhatsApp server infrastructure.

In reality, a user’s keys can change for any number of reasons. Wipe your device and reinstall the app or get a new device and you’ll get a new key. Open Whisper blog posts suggest that the way WhatsApp handles key change is appropriate:

The only question it might be reasonable to ask is whether these safety number [keys] change notifications should be “blocking” or “non-blocking.” In other words, when a contact’s key changes, should WhatsApp require the user to manually verify the new key before continuing, or should WhatsApp display an advisory notification and continue without blocking the user.

[…] we feel that their choice to display a non-blocking notification is appropriate. It provides transparent and cryptographically guaranteed confidence in the privacy of a user’s communication, along with a simple user experience.

What the Open Whisper blog post doesn’t mention is that the “advisory notifications” are optional and not shown by default. Users need to turn the “Show security notifications” option on explicitly in order to receive notifications that their friend’s key has changed.

WhatsApp has done the right thing considering its market and users. Most of WhatsApp’s users can’t be bothered to verify security codes, much less understand what they mean. End to end encryption was introduced after WhatsApp was already very successful and had very large number of users. I doubt “blocking” or even “non-blocking” key change notifications that could potentially confuse users was even an option. They had to find the right balance between security and usability. WhatsApp is secure enough. But it does have a weakness and it’s plausible that if the Big Brother wants to tap into your WhatsApp messages or phone calls, it totally can… probably without you ever knowing.

Leadership vs Management - Leaders Have a Dream, A Vision...

2017-01-16T00:00:00+00:00

Today is the Martin Luther King Jr. Day - a holiday to celebrate the life and legacy of a great Civil Rights Movement leader. Dr. King was an incredibly effective leader who challenged the status quo and transformed the American society forever. He had a dream and got people to rally behind it to make it a reality. And that’s what great leaders do. Dr. King, Gandhi, Henry Ford, Jeff Bezos, all had vision of a world that was very different from the one they lived in and they were able to inspire people to work towards making their vision a reality.

The question is this: in an organization, is there a difference between a manager and a leader or are these synonymous? Some companies regularly refer to their managers as leaders but are these terms always interchangeable.

Let us look at differences between a leader and a manager.

Let’s start with managers. Managers are in charge of people and their responsibility is to get people to achieve some goal or target effectively. They don’t define these goals or targets but rather derive them directly or indirectly based on the strategy or direction the leadership has defined for the company. A bad manager can be very dangerous and can permanently damage team culture and morale. Managers have certain powers and influence that come attached to their position and their subordinates have to dance to their tunes (to varying degrees, but some degree). They are the ones who get to decide who gets the bonus or the promotion.

In fact, management is a set of well-known processes, like planning, budgeting, structuring jobs, staffing jobs, measuring performance and problem-solving, which help an organization to predictably do what it knows how to do well. Management helps you to produce products and services as you have promised, of consistent quality, on budget, day after day, week after week. In organizations of any size and complexity, this is an enormously difficult task. We constantly underestimate how complex this task really is, especially if we are not in senior management jobs. So, management is crucial — but it’s not leadership.

Leadership is entirely different. It is associated with taking an organization into the future, finding opportunities that are coming at it faster and faster and successfully exploiting those opportunities. Leadership is about vision, about people buying in, about empowerment and, most of all, about producing useful change.

Managing people is tough. Imagine if Dr. King had to plan the logistics of his rallies, figure out the carpools or deal with time-off requests. To be effective, organizations need both leaders who are setting direction and looking into the future, and managers who are motivating the crew to keep the ship moving in that direction.

So while there is a difference between leadership and management, both can be done effectively or poorly. John C. Maxwell’s hierarchy or the “levels of leadership”, I believe speak to both leaders and managers since both are in charge of people or teams:

Good leaders and managers have a few things in common. They deeply believe that people, and not some process, are their most important asset. They share a clear and compelling vision and have followers who enroll voluntarily, because they want to be a part of it. They are extremely good at identifying the right people, who don’t need to be babysit or micromanaged to get something done. You can’t go wrong if you hire the right people who have the talents you need, define clear outcomes, trust them with getting the task done, empower them and “get the hell of out of their way”:

… there are a thousand leaders who learned to hire smart people and let them build great things in a nurturing environment of empowerment and it was AWESOME. That doesn’t mean lowering your standards. It doesn’t mean letting people do bad work. It means hiring smart people who get things done—and then getting the hell out of the way.

That’s it. Hope you enjoyed this post. If you have any comments, please leave them in the comments section below.

Happy Martin Luther King Day.

Tutorial - Configuring Photoshop for 2D Pixel Art

2016-12-11T00:00:00+00:00

I’m a huge fan of retro video games and pixel art. Over the Christmas break, I tried (after a long hiatus) to create some pixel art for a retro-style 2D mobile game I was building in Unity for fun. I had to struggle a little in setting up Photoshop to create 2D sprites and the background, so here’s a quick step-by-step tutorial on how to configure Photoshop to create pixel art.

Step 1: Create a Tiny Image

Pixel art is done in very low resolutions. What this means is that you’ll start by creating a very small image, one that you can barely see without zooming in. I can’t give you a rule of thumb, but I generally use 20x20 pixels for sprites (sometimes 40x40 pixels if I want to put in more details) and about 150x80 pixels for backgrounds.

So go ahead and create a new image in Photoshop.

After you have created the image, you’ll be able to see it barely. So zoom in so you are able to see it.

Step 2: Setup Image Interpolation to Nearest Neighbors

When your pixel art is resized or scaled, you’ll want the edges or corners to look hard and jagged instead of smooth and blurred. Be default, Photoshop uses Bicubic interpolation (or Bilinear) that produces a blurred effect when images are enlarged. While Bicubic interpolation works great for normal images, pixel art scaled using Bicubic look terrible and blurry as hell. As as example (source):

Here’s the old man from The Legend of Zelda who gives you the sword. (You may want to squint to see it)

Here he is scaled up 4x with Bicubic interpolation:

Scale here 4x using Nearest Neighbor:

See the difference? Here’s how to configure Photoshop to use the ‘Nearest Neighbor’ image interpolation algorithm.

Note: If you are exporting the image (‘Save for Web’ option) and resizing it, make sure that ‘Nearest neighbor’ is selected under ‘Quality’ or ‘Resample’.

Step 3: Set up the Tools

You’ll need to setup your drawing tools and get the desired pixelated effects. For pencil and eraser tools, here are the settings I used:

size to 1 pixel (px).
hardness to 100%.
opacity to 100%.
for the eraser tool, mode was set to ‘Pencil’.

The only other tool I used was the paint bucket which didn’t require any customization.

Step 4: Show the Grid (optional)

Grid is helpful in positioning and aligning things precisely. I find grid very useful when creating sprites. Grid can be enabled from the ‘View’ menu.

Next, we’ll need to adjust the grid so it can display each pixel individually. Open “Guides, Grids and Slices” settings from the Preferences menu and update the grid settings.

That’s all there is. I hope you found this tutorial helpful and that you go on to create magnificent pixel art :) To quote Bob Ross:

“People might look at you a bit funny, but it’s okay. Artists are allowed to be a bit different.”

Should You Unit Test Private Methods?

2016-11-19T00:00:00+00:00

To unit test private methods or not to test, that’s the question. There are two kinds of software developers in this world: those who never subject private methods to unit testing directly and those who do. Let’s look at both sides of the arguments to understand this better.

This Stackoverflow question highlights the divide. The accepted answer says:

I do not unit test private methods. A private method is an implementation detail that should be hidden to the users of the class. Testing private methods breaks encapsulation.

If I find that the private method is huge or complex or important enough to require its own tests, I just put it in another class and make it public there

User Dave Sherohman disagrees:

[…] Personally, my primary use for code tests is to ensure that future code changes don’t cause problems and to aid my debugging efforts if they do. I find that testing the private methods just as thoroughly as the public interface (if not more so!) furthers that purpose.

Consider: You have public method A which calls private method B. A and B both make use of method C. C is changed (perhaps by you, perhaps by a vendor), causing A to start failing its tests. Wouldn’t it be useful to have tests for B also, even though it’s private, so that you know whether the problem is in A’s use of C, B’s use of C, or both?

Another user writes:

[…] It’s totally valid to have an algorithm in a private method which needs more unit testing than is practical through a class’s public interfaces.

I like my encapsulation and stay away from unit testing private methods directly. Most of the time, the functionality provided by private methods is covered by unit tests for public methods. If not, then it’s a sign that the private method should be given a class of its own. But sometimes, breaking away a new class isn’t easy or feasible, and you might end up introducing more complexity to the design. If that’s the case, it’s alright to go ahead and write unit tests for private methods.

Performance Testing Serverside Applications

2016-11-16T00:00:00+00:00

Performance testing server-side applications is a crucial process to help understand how the application behaves under load. It helps software teams fine-tune their applications to get the best performance while keeping the infrastructure costs low. Performance testing answers several important questions such as:

Is the application ready to handle the traffic that’s going to hit it?
What do average response times and latencies look like under normal and peak loads?
Can the application be scaled out?
What are the bottlenecks? (could be CPU, memory, an external service or a database server)
How many instances are needed for supporting the estimated traffic (i.e. max RPS)?
What type of instances are needed? Does the application requires an instance with higher CPU to Memory ratio? Or does it need an instance type that supports high network utilization?
Does the application slowly degrade in performance under load? Is it slowly leaking a resource that eventually crashes it after a few hours or days?

A lesson I learned recently is that performance testing should not be an after-thought. Software teams should start performance testing early in the release cycle and not wait until the end to do it. I once worked on a team that built a backend service that passed all unit, integration and end-to-end tests with flying colors. QA engineers didn’t find any bugs in the application’s logic. However, the performance was just terrible when we ran load tests on it. On a single m4.large instance, the application supported 80% fewer requests than the team had estimated! The main bottleneck was found to be the 2-core CPU that was utilized to its maximum capacity as the application issued several queries to the database and applied complex algorithms to build a graph. Investigations by developers revealed that to reduce the amount of work the CPU was doing, it would require significant design changes. But it was already too late - the deadline was just weeks away. We decided to proceed with the release - albeit by over-provisioning the hardware and over-running our cost estimates by a factor of 3.

Performance testing is a broad topic. Teams I work with run load and soak tests to measure performance metrics such as throughput, latency, resource utilization, etc. using a wide variety of tools. At Glu, we build REST services in Java and use the following tools for our performance tests:

YourKit Profiler to profile CPU and memory usage at a fine-grained level.
Apache Jmeter to generate load (in reality, Blazemeter or distributed Jmeter)
Amazon Cloudwatch to monitor resource utilization because we deploy our services on the Amazon cloud.
Hosted Graphite to observe custom metrics that the application generates and we are interested in.
Kibana dashboards to look at the logs, errors, etc.

Before I wrap this post up, there are few other important lessons I’d like to share:

Run performance tests on a production-like environment. I have seen teams run performance tests on their MacBooks Pros with 8-core CPU’s and get drastically different results than from the actual cloud instances with puny, virtualized hardware.
Create a good load test plan which requires careful thought. The goal is to emulate the load that real users would generate otherwise you might spend a lot of time chasing ghosts and fixing issues that are unlikely to happen in production. For example, I was investigating an issue with the performance of a Chat service under load. After looking at the load test script, I found that it grouped tens of thousands of users together. In reality, a group has on average about 10-50 people. The fix was to update the load test script to use a random group-id or a group-id from a pool instead of reusing the same id for each request.

I hope this post was helpful. Would love to hear your thoughts, the tools and the approach you take for performance testing. Till next time.

Taking Responsibility for Your Actions

2016-11-14T00:00:00+00:00

I read Pragmatic Programmer whenever I get a chance. It’s such a great book. I was skipping through the pages today and landed on the first chapter. After (re)reading it, I can say that it is arguably the best chapter in the book. All software developers must read it once and live by its philosophy:

One of the cornerstones of the pragmatic philosophy is the idea of taking responsibility for yourself and your actions in terms of your career advancement, your project, and your day-to-day work. A Pragmatic Programmer takes charge of his or her own career, and isn’t afraid to admit ignorance or error. It’s not the most pleasant aspect of programming, to be sure, but it will happen—even on the best of projects. Despite thorough testing, good documentation, and solid automation, things go wrong. Deliveries are late. Unforeseen technical problems come up.

In your career, you will make mistakes. They are inevitable. I can’t count the number of times I have made mistakes. So when you do make a mistake for something you accepted responsibility for:

Don’t deflect responsibility.
Don’t blame another team member.
Don’t blame a vendor.
Don’t blame a library or a tool that you use.
Don’t blame management.
Don’t become defensive.

Sure any of the above factors could have played a role in the failure. But deflecting responsibility by making excuses or blaming someone or something is the worst possible way to handle such situations. What you should do instead is own up and offer solutions. That’s what your coworkers and management are interested in.

I’m reminded of this email from Linus in which he gets furious at a kernel programmer for making up “lame excuses”. (For the record, I strongly disagree with the language in the email.)

Responsibility is a two-way street. It has to be both given and accepted. I have seen managers foolishly thinking that they can walk up to an employee say out loud “You are responsible for this” or “You own this” and walk away. If the employee doesn’t accept or accepts half-heartedly, the responsibility gets diluted. The employee must be willing to accept the responsibility. Quoting from Pragmatic Programmer:

Responsibility is something you actively agree to. You make a commitment to ensure that something is done right, but you don’t necessarily have direct control over every aspect of it. In addition to doing your own personal best, you must analyze the situation for risks that are beyond your control. You have the right not to take on a responsibility for an impossible situation, or one in which the risks are too great. You’ll have to make the call based on your own ethics and judgment.

This doesn’t mean that you should never accept responsibility for anything. That’s not how you grow in an organization and it will not help your career. Be fair, professional and ethical - analyze the risks, determine additional needs such as staff, time or budget and negotiate with your boss or manager. You have every right to think about the situation, but be fair in your assessment.

Git Tips - Undoing Accidental Commits

2016-11-13T00:00:00+00:00

Here are couple of git undo’s to get yourself out of trouble. Before you use anything from this post, please make a copy of the your working directory and store it somewhere safe. Git can be very unforgiving and you may lose your changes without any warning.

Accidentally committed to `master` instead of a new branch

Use the commands below if you have accidentally committed your changes to the master branch instead of a new branch but haven’t pushed them to the remote repository yet.

# Create a new branch copying current state of master
git branch new-branch
# Restore master to second to last commit.
git reset --hard HEAD^
# Switch to the new branch to see your changes.
git checkout new-branch

Accidentally committed to the wrong branch

git cherry-pick <commit-hash> is a handy command to choose a commit from one branch and apply it to another.

# Note the commit-hash you want to move
git log
# Restore branch to second to last commit
git reset --hard HEAD^
# Switch to the right branch
git checkout right-branch
# Apply the commit to the right branch
git cherry-pick commit-hash

Till next time.

Automated Tests Help Developers Sleep Better

2016-11-12T00:00:00+00:00

In Pragmatic Programmer, Andy and Dave wrote:

Most developers hate testing. They tend to test gently, subconsciously knowing where the code will break and avoiding the weak spots. Pragmatic Programmers are different. We are driven to find our bugs now, so we don’t have to endure the shame of others finding our bugs later.

There hasn’t been any shortage of literature on the benefits of automated testing in the last 10-15 years. Yet it comes as a surprise to me that most developers still don’t like to write unit tests. Some managers force developers to write tests by making an arbitrary code coverage number mandatory for releases. This is dangerous because the result is often low quality and ineffective tests that do not catch the bugs that they were supposed to.

But why do some developers dislike testing? One possible explanation is that they don’t buy the idea. Software developers are very smart people and they cannot be coerced into accepting an idea just because management thinks its great. In their defense, the management often fails to coach or convey the idea that automated tests benefit developers the most: they are written by the developers, for the developers themselves. They help developers sleep better at night.

Automated tests are very good at catching bugs before the code is released - not all the bugs, but most of them. Software developers, whether on-call or not, are on the hook for any bugs in their code. When there’s a bug that affects millions of users, it’s the developers who have to get out of their beds at 2am or stay late evenings to provide a fix. Not to mention the embarrassment of introducing a bug that irritated thousands/millions of users or resulted in lost revenue. It takes a long time to find the root cause just because at the time, a developer failed to take a break from implementing the functionality to think about all the ways it could break and write good tests to ensure that the bug was caught before production release.

Developers must be driven to test the sh*t out of their code. We must put a comparable (sometimes more) effort into writing tests as we do on the actual feature itself. Different testing techniques should be used to catch different types of bugs.

Finding bugs is somewhat like fishing with a net. We use fine, small nets (unit tests) to catch the minnows, and big, coarse nets (integration tests) to catch the killer sharks.

Pragmatic programmers don’t just stop with unit, integration or end-to-end tests. They also load test with 10x the peak traffic to catch the killer whales (even this wasn’t enough for Niantic). They test their system by replaying and replicating actual production traffic patterns. They perform monkey testing seeing if their system crashes. They work very hard to break their own code, motivated by the desire to minimize the number of times they’ll have to be rocketed out of sleep or spend their weekend to find and fix bugs in their code.

Building Microservices in Python and Flask (GitHub Project Included)

2016-07-10T00:00:00+00:00

After years of building applications and platforms using the Service Oriented Architecture, I became very interested in microservices last year. So much so that I chose the job offer based on the sole fact that it was providing me an opportunity to design and develop microservices on the AWS platform. I’ll share the pros and cons of microservices in a later post.

In this post, we’ll see how to build microservices in Python using a light-weight framework called Flask. Unlike other web frameworks (e.g. Rails,) Flask is very flexible and doesn’t force you to adopt a specific layout style for your projects. It’s light-weight because it doesn’t require users to use particular tools or libraries. For example, Flask doesn’t come with any database access libraries. You will use extensions to add the functionality that you want.

Cinema 3

I have created a fictional project called Cinema 3 that demonstrates the use of microservices using Python and Flask. In this hypothetical project, we have a few microservices working together to allow users to find movies and books tickets online. The microservices which make up the project are:

Movies: Manages information related to movies e.g. title, rating, etc.
Show Times: Provides show times for movies.
Bookings: Handles online booking.
Users: Manages user accounts for our project.

The microservices talk to each other using REST API. How do they know the address (host:port) of other services? In this example, each microservice runs on a separate port so we could identify them. In real-world projects, people use more advanced techniques such as service discovery (either server or client side; consul is a popular tool that people use for this purpose.)

Flask VS Other Python Web Application Frameworks

Before you decide to use Flask, see how it holds up to against other Python web application frameworks on popularity, features and more. Click on the image below to start comparing.

Sample Code on GitHub

Here’s a link to the project on GitHub. The source code itself is pretty simple as this is just an example to give you a basic understanding of building microservices using Flask.

If you have any comments or question about the project, please let me know in the comments section below.

If you’re looking to learn Flask, here’s an excellent YouTube video series you should watch.

Should the US Allow Foreign Developers?

2016-07-09T00:00:00+00:00

Hiring good developers is really difficult. It’s even more difficult when the market is red hot. The Bay Area is ripe with opportunities for job-seekers and good developers are almost impossible to come by. Today it’s easier than ever for developers to choose the job they really like, at least in the US. While it’s great for the developers, companies and especially start-ups are hurting because there just aren’t enough talented developers to go around. Start-ups and technology companies want the US government to relax its immigration policies so they can bring more developers from other countries. The Anti-immigration groups are against letting “foreigners” in. They argue that the focus should be on training more Americans to learn programming. While they are certainly not wrong about training more people to learn programming, Paul Graham argues that the technology companies are right and the training cannot match the resources elsewhere in the world:

The technology companies are right. What the anti-immigration people don’t understand is that there is a huge variation in ability between competent programmers and exceptional ones, and while you can train people to be competent, you can’t train them to be exceptional. Exceptional programmers have an aptitude for and interest in programming that is not merely the product of training.

The US has less than 5% of the world’s population. Which means if the qualities that make someone a great programmer are evenly distributed, 95% of great programmers are born outside the US.

He’s right. If people don’t have an interest in programming, they give up or even worse, stick around just for the money. I have seen too many “developers” (both American-born and foreign) doing it just for the money. They religiously punch the clock 9 to 5 just for one thing: the paycheck. They have absolutely zero passion towards their profession and no respect for software engineering. To be fair to them, staring at lines of code and thinking in terms of algorithms, isn’t for everyone. Good developers who actually love software but are working in environments where their passion isn’t shared by their colleagues, either quit and mentally check-out.

And what about the claim that technology companies like foreigners because they can pay them lower wages? Or by having more options available, they can drive the prices down? I personally haven’t seen the wage difference. Foreigners are paid about the same as locals and bringing them over to US isn’t cheap either: there are legal fees, relocation bonuses and a huge risk that the person may not like the job and quit within a year.

So they [Anti-immigration groups] claim it’s because they want to drive down salaries. But if you talk to startups, you find practically every one over a certain size has gone through legal contortions to get programmers into the US, where they then paid them the same as they’d have paid an American. Why would they go to extra trouble to get programmers for the same price? The only explanation is that they’re telling the truth: there are just not enough great programmers to go around.

I know at least three start-ups that shipped their entire software development to India (that’s 30 software development jobs between them) just because they couldn’t find developers locally.

So should the US allow a few thousand great programmers to come in every year? I think the answer is that it absolutely should. Growing up, my dad used to tell me that one of the reasons why the US is the greatest country in the world is because it attracts bright people from all over the planet and provides them plenty of opportunities to succeed. China may have the potential to challenge the Silicon Valley. Paul Graham asks what would happen if great programmers gathered someplace else:

What if most of the great programmers collected in one hub, and it wasn’t here [in the US]? That scenario may seem unlikely now, but it won’t be if things change as much in the next 50 years as they did in the last 50.

We have the potential to ensure that the US remains a technology superpower just by letting in a few thousand great programmers a year. What a colossal mistake it would be to let that opportunity slip. It could easily be the defining mistake this generation of American politicians later become famous for. And unlike other potential mistakes on that scale, it costs nothing to fix.

Do you agree or disagree? Please leave your comments in the section below.

Interactive Emails with Email Markup

2016-07-08T00:00:00+00:00

It was a pleasant surprise when itinerary for my upcoming flight automatically showed up on my Galaxy S6. I didn’t have open my Gmail and search for the confirmation email. It was right there, just when I needed it to check the departure time.

(disclaimer: not my itinerary)

I immediately started wondering how it did that. At first, I thought Google did this by parsing and analyzing the text in the confirmation email and making a best-effort extraction of the flight information. But that wouldn’t be very reliable and luckily that’s not how it works. In reality, the confirmation email that was sent to me contained additional information, as metadata, which Google used to display the itinerary. Google calls it the Email Markup. In essence, senders create and put markup in their emails to instruct Google to display an interactive action to the recipients, such as RSVP an event, confirm subscription, fetch boarding passes, etc.

Email is an important part of how we get things done – from planning an event with friends to organizing a trip to Paris. So much information is contained inside emails – like the details of a dinner party or travel itinerary – and so many emails require action – like RSVP, or flight checkin.

By adding schema.org markup to the emails you send your users, you can make that information available across their Google experience, and make it easy for users to take quick action. Inbox, Gmail, Google Calendar, Google Search, and Google Now all already use this structured data.

Some actions show up as buttons next to the subject line in the inbox. While this feature has been around for a couple of years at least, I believe Google has only recently started using this information across other services like Google Now and Google Calendar.

I personally find this very useful and it has some interesting use cases:

Flight reservations - Includes options for displaying basic flight confirmation information, boarding pass, check-in, update a flight, cancel a flight, and additional options. This Highlight is also supported in Google Now.

Orders - Includes options for displaying basic order information, view order action, and order with billing details.

Parcel deliveries - Includes options for displaying basic parcel delivery information and detailed shipping information.

Hotel reservations - Includes options for displaying basic hotel reservation information, updating a reservation, and canceling a reservation. This Highlight is also supported in Google Now.

Restaurant reservations - Includes options for displaying basic restaurant reservation information, updating a reservation, and canceling a reservation. This Highlight is also supported in Google Now.

Event reservation - Includes options for basic event reminders without a ticket, event with ticket & no reserved seating, sports or music event with ticket, event with ticket & reserved seating, multiple tickets, updating an event, and canceling an event. This Highlight is also supported in Google Now.

For more information:

Kristi Hines has written a good article if you are interested in using Email Markup for your product your services.
Guide from Google.

Unit, Integration and End-To-End Tests - Finding the Right Balance

2016-07-05T00:00:00+00:00

This is something I have regrettably noticed in many backend projects that I have worked on. Developers write “unit tests” that in reality are ‘end-to-end’ tests. They test the entire flow of the application from start to the end. There is no isolation of units and the notion of the unit is the whole system, along with all of its external dependencies like databases, queues, caches, and other services. For a web server project, these tests start the server, initialize a HTTP client, make a HTTP request and check the response to make sure it has all the expected information. If so, the test is declared a success. By treating the whole system as a unit and not testing independent units in isolation and their interplay, we loose many benefits that unit and integration tests offer.

Technically speaking, these developers aren’t violating the definition or principles of unit testing. Unit testing is ill-defined. I don’t claim to be an expert, but in my humble opinion:

Unit testing should focus on testing small units (typically a Class or a complex algorithm).
Units should be tested in isolation and independent of other units. This is typically achieved by mocking the dependencies.
Unit tests should be fast. Usually shouldn’t take more than a few seconds to provide feedback.

Most projects benefit from having a balanced mix of various automated tests to capture different types of errors. The exact composition of the mix varies depending on the nature of the project, as we’ll see later.

End-to-end tests are good at capturing certain kinds of bugs, but their biggest drawback is that they cannot pin-point the root cause of failure. Anything in the entire flow could have contributed to the error. In large and complex systems, it’s like finding a needle in the haystack: you’ll find the root cause, but it will take time. Because unit tests focus on small modules that are tested independently, they can identify the lines of code that caused the failure with laser-sharp accuracy, which can save a lot of time.

Another nice thing about unit tests is that they always work, and they work fast. Unlike end-to-end tests that rely on external components, unit tests are not flaky. If I can build a project on my machine, I should be able to run its unit tests. In contrast, end-to-end tests would fail if some external component, like a database or a messaging queue, is not available or cannot be reached. And they can take a lvery ong time to run.

Unit tests allow developers to refactor and add new features with confidence. When I’m refactoring a complex project that has well-written unit tests, I run them often, usually after every small change. In a matter of few seconds, I know whether I broke something or not. Even better, a failing test usually prints a nice message telling me what broke: whether some GuardAssertion failed or the expected response was off by one, helps me isolate the failure.

Between unit and end-to-end tests lie integration tests. They have one major advantage over unit tests: they ensure that modules which work well in isolation, also play well together. Integration tests typically focus on a small number of modules and test their interactions.

The key is to find the right balance between unit, integration and end-to-end tests. According to Google’s Testing Blog:

To find the right balance between all three test types, the best visual aid to use is the testing pyramid. Here is a simplified version of the testing pyramid […]:

The bulk of your tests are unit tests at the bottom of the pyramid. As you move up the pyramid, your tests gets larger, but at the same time the number of tests (the width of your pyramid) gets smaller.

As a good first guess, Google often suggests a 70/20/10 split: 70% unit tests, 20% integration tests, and 10% end-to-end tests. The exact mix will be different for each team, but in general, it should retain that pyramid shape. Try to avoid these anti-patterns:

Inverted pyramid/ice cream cone. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests.

Hourglass. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle.

70/20/10 split between unit, integration and end-to-end tests is a good, general rule of thumb. If a project has large number of integrations or complex interfaces, it should have more integration and end-to-end tests. A project that is primarily focused on computation or data, should have more unit tests and fewer integration tests. The right mix depends on the nature of the project but the key is to retain the pyramid shape of the testing pyramid, that is, Unit > Integration > End-to-End Tests.

RESTful - What Are Idempotent and Safe Methods and How to Use Them?

2016-07-04T00:00:00+00:00

One of the challenges when designing a REST API is choosing the right HTTP method (GET, PUT, POST etc.) that corresponds with the operation being performed. Some people incorrectly assume that they can freely choose any method as long as the client and the server agree on it. This is wrong because a request passes through many intermediaries and middleware applications which perform optimizations based on the HTTP method type. These optimizations depend on two key characteristics of HTTP methods: idempotency and safety, which are defined in the HTTP specification.

Safe HTTP Methods: Safe methods aren’t expected to cause any side effects. These operations are read-only. E.g. querying a database.

Idempotent HTTP Methods: Idempotent methods guarantee that repeating a request has the same effect as making the request once.

Idempotency and safety are properties of HTTP methods that server applications must correctly implement. This means if you are implementing an operation and choose an idempotent HTTP method to invoke the operation, you must ensure that the implementation returns the same result if invoked once or multiple times for the same input.

GET: Idempotent & Safe

GET requests are used for retrieving information. These requests must be idempotent and safe: any operation invoked using GET must not alter the state of any resource.

GET /books

To get a specific book,

GET /books/<title>

POST: Non-Idempotent

POST requests are not idempotent. They are used for creating new resources or updating existing ones. For example, suppose we have a resource called Student with the following attributes: name, college, major and gpa. To enroll a new student, we can use POST to create a new resource:

POST /students/ // Create a new student
{
  "name": "Michael Scarn",
  "college": "Stanford",
  "major": "computer science"
}

POST requests are allowed to perform partial updates. For example, to update the GPA of a student, we’ll make the POST request on the specific record (given by the student_id) and only supply the attribute we want to update:

POST /students/<student_id>
{
  "gpa": "3.9"
}

In the above example, if the student_id doesn’t exist, the application should return HTTP 404: Not Found error.

PUT: Idempotent

PUT requests are idempotent. This means that identical requests can be repeated multiple times, and the state of the resource on the server doesn’t change any further after the first request. PUT can be used to create a new resource or update an existing one. For updates, PUT requests must contain all the attributes of the resource, unlike POST requests which can have partial attributes. Here’s a PUT request to update a student’s GPA:

PUT /students/<student_id>
{
  "name": "Michael Scarn",
  "college": "Stanford",
  "major": "computer science",
  "gpa": "3.9"
}

In the above example, if the student_id in the PUT request doesn’t exist, the application may create a new record and assign it the student_id.

If you are wondering why PUT requests must have all the attributes and not just the one we want to update, consider the following hypothetical situation in which we allow partial updates in the PUT request. Suppose that a student is switching majors. The client issues two PUT requests: the first one to update student’s GPA for the existing major and a second request to update the major and reset the GPA. The PUT requests are made in the right order:

PUT /students/<student_id> //Partial update: Violates idempotency contract
{
  "gpa": "3.85"
}

PUT /students/<student_id> //Partial update: Violates idempotency contract
{
  "major": "engineering",
  "gpa": "0.0"
}

Assume that the request for the second requests arrives before the first one which is lost due to a network error. Because the client thinks that PUT requests are idempotent, it may retry the first request a second time, that will incorrectly update the student’s GPA to 3.85 for the new major, leaving the resource in an inconsistent state. It could be very hard to trace and breaks the idempotency contract: multiple invocation result in different states.

PUT /students/<student_id>
{
  "name": "Michael Scarn",
  "college": "Stanford",
  "major": "engineering",
  "gpa": "3.85" // GPA updated for the wrong major!
}

Partial updates are important especially if the resource contains a lot of attributes. It is wasteful to first fetch the resource using GET and then send all the attributes when the goal is to update just a few. In reality, a lot of developers allow partial updates with PUT, which is a violation of the idempotency contract. For partial updates, use POST, or as we’ll later see, PATCH.

POST vs PUT - Partial vs Complete Updates to Resource

There is a common misconception that it isn’t RESTful to use the HTTP POST method to update existing resources. People who hold this view suggest that one should always use the PUT method for updates and POST for creating new resources. This isn’t quite right. The REST standard doesn’t stop us from using POST requests for updates. In fact, it doesn’t even talk about it because idempotency and safety guarantees are properties of the HTTP protocol, not of the REST standard. Roy Fielding, writes:

Some people think that REST suggests not to use POST for updates. Search my dissertation and you won’t find any mention of CRUD or POST. The only mention of PUT is in regard to HTTP’s lack of write-back caching. The main reason for my lack of specificity is because the methods defined by HTTP are part of the Web’s architecture definition, not the REST architectural style. […] For example, it isn’t RESTful to use GET to perform unsafe operations because that would violate the definition of the GET method in HTTP, which would in turn mislead intermediaries and spiders.

Therefore, the choice between PUT and POST boils down to one thing: idempotency guarantee of these methods. Because PUT is idempotent, clients or intermediaries can repeat a PUT request if the the response for the first request doesn’t arrive on time, even though the request may have been processed by the server. In order to stay idempotent, PUT requests must replace the entire resource and hence must send all the attributes. For partial updates, POST or PATCH (non-idempotent methods) must be used.

DELETE: Idempotent

DELETE requests are idempotent and are used for deleting a resource. One common confusion people have with DELETE calls is what type of HTTP status code to return on repeat calls. Some people assume that because DELETE is idempotent, it must always return the same HTTP status e.g. HTTP 200. This assumption is wrong. Although there is no harm in returning the same status code, if your use case requires it, returning a different status code like the HTTP 404: Not Found doesn’t violate the idempotency contract. Idempotency doesn’t concern itself with what is returned to the client. It refers to the state of some resource on the server. So it is perfectly valid to return HTTP 200: OK on the first delete call, and HTTP 404: Not Found on subsequent ones, since in both cases, the resource is deleted and its state isn’t changed on the server side. You might also use HTTP 204: No Content if the response body is empty.

PATCH: Non-Idempotent

I like to think of PATCH as the non-idempotent cousin of the PUT request. Because it is non-idempotent, it could be used partial updates:

PATCH /students/<student_id> //Partial update: OK
{
  "gpa": "3.85"
}

It is up to you whether you want to adopt PATCH for partial updates or use POST.

Here’s a table summarizing the results:

HTTP Method	Idempotent?	Operation?
GET	yes	Retrieval or query.
POST	NO	Create or update resources. Partial updates are allowed.
PUT	yes	Create or update resources. Partial updates are not allowed.
DELETE	yes	Delete resources.
PATCH	NO	For partial updates.

Idempotent and safe HTTP methods - REST API

2016-06-30T00:00:00+00:00

If you are designing or building REST APIs, you should be aware of two very important properties of HTTP methods: idempotency and safety. These properties are defined in the HTTP specification. I’m calling them properties, but ‘guarantees’ would be a better term: you don’t automatically get them; you actually need to design for these guarantees because your clients expect you to follow the contract. Let’s get the definitions out of the way first and then we’ll look at the contract and why it’s important to stick to it.

Idempotent HTTP Methods

An operation is idempotent if it will produce the same results when executed once or multiple times. For example, it doesn’t matter how many times I submit a request to set my current location to ‘San Francisco’. The final outcome will be the same: the city field in the database is set to ‘San Francisco’. On the other hand, a request to POST a new message to the forum is not idempotent: the same message will be stored or sent multiple times if the client. Some people, wrongly, assume that for a request to be idempotent, the same response must be sent back to the client each time: idempotency has nothing to do with the response that’s sent back to the client. It’s a server side guarantee ensuring that the state of the resource on the server does not change any further after the first request, no matter how many times the request is duplicated.

Some idempotent operations have an additional, special property: they do not modify the state on the server side at all. Simply put, these methods are read-only and have absolutely zero side-effects. For example, a query to retrieve my current city doesn’t change the database. These types of operations are given a special name: safe or nullipotent methods:

Related is the idea of nullipotence: a function is nullipotent if not calling it at all has the same side effects as calling it once or more. In practice, this simply means that the function doesn’t have any side effects at all. A database query saying “get row 42” is a good example. Nullipotence is clearly a stronger condition than idempotence.

Here’s a list of the most commonly used HTTP methods and whether they are idempotent and/or safe as defined by the contract:

HTTP Method	Idempotent?	Safe?
GET	yes	yes
HEAD	yes	yes
PUT	yes	NO
DELETE	yes	NO
POST	NO	NO

Is PATCH idempotent?

Patch is not idempotent and not safe. PATCH is used for partial updates to resources. It is a common source of confusion because it is possible to use PATCH in such a way as to be idempotent such as using etags of If-Modified-Since header.

Why idempotency matters?

Idempotency and safety (nullipotentcy) are guarantees that server applications make to their clients and the world. It is a contract defined by the HTTP standard that developers must adhere to when implementing REST APIs over HTTP. An operation doesn’t automatically become idempotent or safe just because it is invoked using the GET method, if it isn’t implemented in an idempotent manner. A poorly written server application might use GET methods to update a record in the database or to send a message to a friend (I have seen applications that do this.) This is a really, really bad design.

Adhering to the idempotency and safety contract helps make an API fault-tolerant and robust. Clients, middleware applications and various servers that requests pass through before reaching your application, use this contract for various optimizations. Clients may automatically cancel a GET request if it is taking too long to process and repeat it because they assume it has the same effect (since GET is idempotent). However, they won’t do the same thing for POST requests because the first one may have already altered some state on the server side. This is the reason why web browsers display a warning message that you are about to re-submit a form when you hit the back button to go to a form (For this reason, always redirect after a successful POST operation.) In the same veins, cache servers don’t cache POST requests and safe methods can pre-fetched to stored in cache to enhance performance.

In summary, when building RESTful applications using HTTP, it is important to implement HTTP methods in a manner that satisfies their idempotency and safety contract, because clients and intermediaries are free to use this contract to optimize and enhance the user experience. Don’t use GET method for operations that alter the database and don’t use POST to retrieve information (with one exception).

See you next time.

Behind Monty Hall's Closed Doors - Our Limited Minds

2016-06-26T00:00:00+00:00

There’s a classic brain-teaser in the field of probability that goes like this:

Imagine that you’re on a television game show and the host presents you with three closed doors. Behind one of them, sits a sparkling, brand-new Lincoln Continental; behind the other two, are smelly old goats. The host implores you to pick a door, and you select door 1. Then, the host, who is well-aware of what’s going on behind the scenes, opens door 3, revealing one of the goats.

“Now,” he says, turning toward you, “do you want to keep door 1, or do you want to switch to door 2?”

Would you switch when given the choice or stick to your guns? If you haven’t heard this puzzle before, think about your answer carefully. It might just surprise you.

Go on, think about it. Continue reading only after you have thought of an answer.

Most people who attempt this problem reach the conclusion that the odds don’t change and it doesn’t matter whether you switch or stick to your original choice. The correct answer, however, is that you have a much better chance of winning if you switch. Sal Khan explains the math behind getting better odds of wining by switching in this video. You can even try it for yourself here.

This problem is know as the Monty Hall Problem. It is named after the host of a television gameshow “Let’s Make a Deal”. What’s interesting about this puzzle is the controversy it caused in the academic world. It started when Marilyn vos Savant, listed in the The Guinness Book of Records for highest recorded IQ, answered the puzzle in her column in response to a question from one of her readers. She wrote:

Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance. Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door 1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door 777,777. You’d switch to that door pretty fast, wouldn’t you?

Her response provoked outrage from thousands of academics who insisted that she is wrong - the odds do not change:

I’m receiving thousands of letters, nearly all insisting that I’m wrong, including the Deputy Director of the Center for Defense Information and a Research Mathematical Statistician from the National Institutes of Health! Of the letters from the general public, 92% are against my answer, and and of the letters from universities, 65% are against my answer. Overall, nine out of ten readers completely disagree with my reply.

It’s just mind boggling that the puzzle evaded so many talented mathematicians and scientists who vehemently criticized Marilyn. Here’s what one PhD wrote:

You blew it, and you blew it big!

Since you seem to have difficulty grasping the basic principle at work here, I’ll explain. After the host reveals a goat, you now have a one-in-two chance of being correct. Whether you change your selection or not, the odds are the same. There is enough mathematical illiteracy in this country, and we don’t need the world’s highest IQ propagating more. Shame!

Scott Smith, Ph.D., University of Florida

From a true statistical standpoint, Marilyn is 100% correct. So why did so many experts have trouble wrapping their heads around this problem? According to Stanford professor Persi Diaconis:

“Our brains are just not wired to do probability problems very well, so I’m not surprised there were mistakes,”

So there’s a psychological side to the puzzle and, in reality, most contestants did not accept the switch and sticked to their initial choice:

Mr. Hall said he was not surprised at the experts’ insistence that the probability was 1 out of 2. “That’s the same assumption contestants would make on the show after I showed them there was nothing behind one door,” he said. “They’d think the odds on their door had now gone up to 1 in 2, so they hated to give up the door no matter how much money I offered. By opening that door we were applying pressure. We called it the Henry James treatment. It was ‘The Turn of the Screw.’”

Apparently for contestants and experts alike, the Monty Hall Problem causes cognitive dissonance which is the mental stress when an individual holds two or more contradictory beliefs or is presented with new information that conflicts with their existing beliefs:

When people are confronted with evidence that is “inconsistent with their beliefs” (ie. the odds of winning by switching doors being ⅔, instead of ½), they first respond by refuting the information, then band together with like-minded dissenters and champion their own hard-set opinion. This is precisely the mentality of vos Savant’s thousands of naysayers.

The way I see it: after one door is eliminated, most people believe that they have a 50/50 percent change of winning a brand new car or a smelly goat since there are only two doors left - the distribution is uniform. They believe in it so strongly that even after hearing correct explanations, that the distribution isn’t uniform, they refuse to believe in them because they conflict with their initial, more strongly held belief. They conveniently ignore the new information. The Monty Hall Problem is so counter-intuitive to our beliefs that Paul Erdos, one of the most brilliant mathematicians of the last century, rejected Marilyn’s correct solution and remained unconvinced until he was shown a computer simulation. According to Jeff Atwood, Paul Erdos finally realized his own limits:

Paul Erdos was brilliant, but even he realized his own limits when presented with the highly unintuitive Monty Hall problem. For his epitaph, he suggested, in his native Hungarian, “Végre nem butulok tovább”. This translates into English as “I’ve finally stopped getting dumber.”

If only the rest of us could be so lucky.

What is Serverless Architecture? AWS Lambda Features (2020)

2016-06-25T00:00:00+00:00

Serverless is the new buzzword that is quickly gaining momentum and attention.

The concept is to be able to run server-side code without worrying about the messy details of provisioning and setting up servers, disk drives and other resources. You write code, upload it and — voilà! — it starts running. All the complications of managing the infrastructure, provisioning servers, auto-scaling, installing languages and frameworks are eliminated and hidden away by the cloud provider (AWS, Azure, Google Cloud). The cloud provider takes care of allocating and managing the resources, invoking the code in response to a request, providing it the context and input information it needs to do its job and return the result to the client. By focusing less time on managing scaling and availability, software developers are increasingly using serverless architectures for more advanced workloads.

Functions as a Service

There is no clear view or consensus on what serverless is; for many people, it means writing your code as function and giving it to cloud providers for execution. This is referred to as Functions as a Service** or *FaaS. This view of serverless is the main focus of this article.

All major cloud vendors provide FaaS:

AWS Lambda on AWS. The most popular implementation of FaaS.
Azure Functions on Microsoft Azure
Cloud Functions on Google Cloud

Going serverless requires a different approach to application design. The backend service is broken down into stand-alone functions that perform a single task in response to a user action or event. In serverless architectures, the backend is composed of thin, single-purpose functions that are event driven. The business logic shifts from the backend to the client e.g. mobile app. It becomes the main orchestrator, calling various functions to perform some action for the user when needed. For example, running in serverless architecture, a photo-sharing app like Instagram might call one function to upload image to the server followed by another call to a function that reads all the followers’ information from a database and notifies them.

Serverless architectures require smart clients that know about and talk to a wide range of remote functions. While mobile app developers have had rich frameworks and platforms that allowed them to build complex logic on the client easily, things weren’t so simple for web applications. But thanks to rich client-side application frameworks like React and Angular, and fast HTTP/2 protocol, it is now possible to build complex applications seamlessly into the browser. This will help drive the serverless trend even further.

Amazon Lambda - Features, Pros and Cons

Amazon Web Services (AWS), the undisputed leader of cloud computing, launched a product called Lambda for serverless applications back in 2014.

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

I gave it a try and converted one of our small microservices into a Lambda function.

Setting up a Lambda function in AWS was straightforward. The only challenge I faced was connecting the function to another AWS product, the API Gateway, to expose it as a REST end-point. Here are the pros and cons I discovered during the process.

1. Lambda Programming Languages

AWS Lambda supports not all, but major languages. Java 8, Python, Go, Powershell, Ruby, C# and Node.js. In addition, there is a Runtime API which allows you to use any programming languages for your functions.

Another thing which I liked about Lambda is that there are many templates to choose from when creating a new Lambda function that could come very handy. They contain examples on how to access various databases and integrate with other AWS products and services.

2. Lambda Latency & Cold Starts

Up until recently, Lambda startup times were an issue for latency-sensitive applications. ‘Cold functions’ are those that haven’t run in some time. When a new request or event triggers a cold Lambda function, the cloud provider needs to find an environment (server) to load the function and its related resources into and run. This usually took 50 to 500 milliseconds. Once Lambda is loaded onto a server, it will stay there for some time ~ 30 minutes or so. If a new request comes in, it is executed immediately because the function is already loaded. This might not sound like an issue, but for latency-sensitive applications that dealt with thousands of simultaneous requests with irregular traffic patterns, warming up functions to keep them ready was a problem.

AWS recently introduced Provisioned Capacity that fixes this issue by allowing developers to specify the number of warm instances they want to keep ready always to handle incoming requests. This is an excellent feature that gives developers greater control on ensuring low latencies and faster response times. Provisioned capacity isn’t free so please do check the pricing so you understand the costs before you use it.

3. Lambda Execution Duration Limit

The execution duration of lambda functions has an upper time limit. It is currently at 15 minutes. This is more than sufficient for many use cases but could be an issue for batch type applications or tasks that are performing a long running-task like converting videos, etc. I don’t see this as a huge issue.

4. Pricing

The pricing is based on the number of requests and the duration of script’s execution, billed in 100 millisecond increments. So if a Lambda function runs for 15 milliseconds, it will be billed for 100. This could be an issue for very high-volume applications with lots of short-running functions. A crude hack to get the best bang for the buck would be to combine short-running Lambda operations into a larger one. Also, if you want to expose your Lambda methods as REST end-points using AWS API Gateway, you’d incur extra costs as the API Gateway has separate pricing.

5. Lambda is Stateless

Lambda functions are stateless and asynchronous. Each function invocation has no idea about the state of previous invocations and its output or state isn’t automatically available to subsequent functions. You can still access external data by calling other services such as S3 or ElastiCache.

It would be wonderful to share a few things like connection pools, that are expensive to setup. Connection pooling isn’t properly supported. Setting up and tearing down database connections for each request increases latency and affect performance. Although there are work arounds, like using Amazon RDS Proxy to maintain connection pools.

6. Lambda Debugging and Logging

Debugging and logging isn’t easy and has a learning curve. When testing my Lambda function, I spent a lot of time scrolling through CloudWatch entries to find issues.

7. Lambda CI/CD

Lambda functions can be setup for automatic deployments through CI/CD pipelines. You could host your function code on GitHub, setup a new pipeline using AWS CodePipeline and then use AWS CodeBuild to build and deploy the function.

Future of serverless vs server. Should we ditch the servers?

While the AWS Lambda and serverless haven’t yet broken the threshold, the future looks promising. Going serverless requires a shift in thinking, re-inventing tooling and setting up the right processes for version control, deployments, cost monitoring and control, monitoring, testing, security, etc.

No, developers are not going to ditch the servers and move everything to servless. Instead, developers are adopting serverless for certain use cases where it fits great, while still continuing to use servers, containers and microservices.

Serverless is a great concept. Infrastructure management is challenging and can be very painful and requires a dedicated team who can manage resources. It shifts the focus away from the real problem to the undifferentiated heavy lifting of managing servers, auto-scaling groups, instance tagging, and even worse, building infrastructure specific logic in applications like the health checks. Development time increases because developers now have the burden of managing their infrastructure. That’s the biggest beef I have with DevOps: it forces skilled developers to spend their time and energy worrying about infrastructure intricacies instead of building useful applications that solve real problems. Serverless architectures take some of these barriers away and reduce the friction that allows developers to get started quickly.

Updated: March 1, 2020

Blameless Postmortems - Examining Failure Without Blame

2016-06-20T00:00:00+00:00

Let’s face it: failure is inevitable in complex systems. It cares not for the number of tests you ran, code reviews or your monitoring tools. It just happens. And how is failure usually dealt with? Instead of learning from it to improve the resilience of the system, the traditional view is to assign blame and point fingers at individuals responsible for the failure. It’s easier to identify a culprit than to find the real cause. In The Field Guide to Understanding Human Error, author Sidney Dekker refers to this as the “old view” that leads us nowhere:

When faced with a human error problem, you may be tempted to ask ‘Why didn’t they watch out better? How could they not have noticed?’. You think you can solve your human error problem by telling people to be more careful, by reprimanding the miscreants, by issuing a new rule or procedure. These are all expressions of ‘The Bad Apple Theory’, where you believe your system is basically safe if it were not for those few unreliable people in it. This old view of human error is increasingly outdated and will lead you nowhere. The new view, in contrast, understands that a human error problem is actually an organizational problem.

When employees are blamed and shamed by their superiors, who have the the power of hindsight on their side, few things happen:

Employees become defensive and lose motivation. The overall team sociology and culture suffers.
Employees start hiding mistakes. The team and the company doesn’t learn any lessons and nothing is done to prevent failures from happening again.
No one actually takes the responsibility and everybody blames each other.

So how should companies handle mistakes?

When failure occurs, the role of the management should be to figure out what happened so they can improve something to prevent it from happening again. But the management doesn’t have a crystal ball that can give out all the details. They have to rely on their employees for this information. In order for employees to come forward and admit responsibility for their mistakes, the right type of culture and environment must be present. One that doesn’t punish people for their mistakes. It requires a “Just Culture”. The CTO of Etsy, John Allspaw, describes it:

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

what actions they took at what time,

what effects they observed,

expectations they had,

assumptions they had made,

and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

For management, it starts with hiring good people and assuming that everyone involved in the failure had good intentions; they didn’t think that the mistake was possible. If you have incompetent turkeys on your team, no amount of processes or metrics will save the project. After you have hired good people, trust them to make the right decisions. Managers who constantly question the competence of their teams can’t build productive teams.

When failures occur, the goal should be to understand ‘what’ caused the failure without focusing on the ‘who’. There are various techniques to get to the root cause. I have been using a technique known as the ‘5 Whys’ which works really well in most situations. It was developed by Sakichi Toyoda and its goal is to:

determine the root cause of a problem by repeating the question “Why?” Each question forms the basis of the next question.

Here’s an example of 5 Whys in practice from Joel’s blog:

[Problem:] Our link to Peer1 NY went down

Why? – Our switch appears to have put the port in a failed state

Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch

Why? – The switch interface was set to auto-negotiate instead of being manually configured

Why? – We were fully aware of problems like this, and have been for many years. But - we do not have a written standard and verification process for production switch configurations.

Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

For more comprehensive and formal analysis that goes beyond ‘first stories’, John Allspaw collecting “second stories” from multiple sources and perspectives as part of the formal postmortem process:

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories	Second Stories
Human error is seen as cause of failure	Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure	Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away	Only by constantly seeking out its vulnerabilities can organizations enhance safety

A Just Culture and blameless postmortems aren’t about avoiding accountability. In fact they achieve the opposite effect by creating a culture where people can freely admit their mistakes and learn from them. A mistake can be a great opportunity to learn a valuable lesson from. When failure happens, a thorough analysis is performed, with an emphasis on process over people, to make it better. These analyses or ‘postmortems’ should be performed after the problem has been solved; when emotions aren’t running high. The results or findings must be shared with the entire team or the company.

It’ll be helpful to keep in mind that most people have a natural tendency to assign blame to others. Some do it more explicitly than the others. As managers when you are trying to create a Just Culture, you need to keep your eyes and ears open and when you detect finger pointing, you must tactfully shift the focus away from people and back to the process.

Continuous Delivery - Automating the Release Process

2016-06-18T00:00:00+00:00

For many software developers, release days are stressful events. There’s always some risk that things might go wrong in the process or that a bug would surface in production. At my previous company, we had a manual release process that was very human intensive; hence, error prone. On release days, DevOps would load binaries on a staging environment and perform user acceptance tests (UAT) manually. If the tests were successful, the software was copied to production servers and verified with smoke tests and occasionally, a trimmed down version of the UAT was run again. Here are the common problems we faced:

2 out 3 times when we had to roll back a release because of an issue, it was due to a configuration mismatch between staging and production environments.
The release process was slow and it took a long time to put new features in the hands of our users. It wasn’t uncommon for the release process to take days or sometimes even weeks.
The slow release process and the manual UAT had another side-effect: the developers didn’t get timely feedback. By the time feedback arrived, they were often in the middle of another feature. This incurred additional overhead because their memories were no longer fresh and on rare occasions, the error got re-introduced due to mix-up between branches.

In short, manual and ad-hoc release processes are suboptimal and the release day is fraught with stress. In our case, it was tolerable until releases became more frequent and the team grew. To improve and automate the release process, there’s a software engineering approach know as Continuous Delivery (CD).

Continuous delivery makes it possible to release new features quickly and reliably. It provides fast feedback to developers. The software is built in a way where it can be automatically and safely released to production at any time. This is ensured by delivering every change to a production-like environment and running extensive automated tests on it. According to Martin Fowler, you are doing continuous delivery if:

[…] a business sponsor could request that the current development version of the software can be deployed into production at a moment’s notice - and nobody would bat an eyelid, let alone panic.

Continuous delivery has a pre-requisite known as Continuous Integration (CI). CI is an Extreme Programming practice which requires that the new changes are integrated into the main branch as soon as they are finished so that the project is always kept in a working state. Usually, it works like this: developers commit their changes to GitHub (or another VCS) which triggers the build process. The entire application along with the required dependencies is built and a comprehensive set of unit and integration tests are run against it. If the tests fail, the team stops working and fixes the issue until the software is in working state again. Without CI, integration can very easily become a nightmare. Continuous integration is great and it’s one of the very first things I do when starting a new project.

I have seen many examples where teams stop paying attention to broken builds. This usually happens when the CI process becomes a big, hairy beast. This defeats the main goal of CI: broken builds should never be ignored and fixing them should be the top priority for the team. To ensure that it happens, the CI process should be kept short, sweet and simple. If tests take too long to run, are unreliable or can’t help pin-point the problem, the team will quickly stop paying attention to the broken builds or even worse, finger-pointing like “the other project broke the build” will ensue in dysfunctional teams.

Continuous integration is mainly focused on development teams. It’s possible to have CI with a manual release process. In our case, we had CI but the binaries and corresponding configuration files were manually copied to staging and production environments. In contrast, the continuous delivery automates the release process end-to-end. To achieve that goal, it uses a pipeline with distinct stages and their associated processes.

A continuous delivery pipeline is a manifestation of your release process for getting a new release out of the door. According to Martin Fowler:

One of the challenges of an automated build and test environment is you want your build to be fast, so that you can get fast feedback, but comprehensive tests take a long time to run. A deployment pipeline is a way to deal with this by breaking up your build into stages. Each stage provides increasing confidence, usually at the cost of extra time. Early stages can find most problems yielding faster feedback, while later stages provide slower and more through probing. Deployment pipelines are a central part of ContinuousDelivery.

A typical CD pipeline might look like the following:

A crucial part that determines the success of CD pipelines is well-written acceptance tests that are run in the later stages of the pipeline for “more through probing”. They ensure that the software satisfies user requirements and specifications. Acceptance tests should not be exposed to internal system details and should treat it like a black-box. Our acceptance tests consisted of input that a real user would provide and the tests received and checked the system’s output to verify that it matches expectation.

Transition from one stage to the next in the CD pipeline could be automatic or manual. Manual doesn’t imply copying artifacts manually to the next stage. It means that someone needs to signal the system that it’s okay to transition to the next stage usually at a push of a button.

Continuous delivery pipelines are modeled after delivery processes. There’s no right way: one pipeline may look very different from another. For example, in an SOA project with many stand-alone components, we decided that a single pipeline for all the components was the best solution. Another project required individual pipelines for each component (microservice). The integration pipeline, shown in the picture below.

Implementing a good continuous delivery pipeline can be a daunting task but yields great benefits if done right. In my opinion, the best way is to carefully study your deployment process, understand all the dependencies, get buy-in from the team and start with something small and simple.

Continuous Delivery vs Continuous Deployment

In continuous delivery, a human makes the final call to deploy the release to production environment. Typically, the release to production happens several changes later or at some fixed date.

Continuous deployment evolves the continuous delivery process one step further: every change that happens to pass automated tests is deployed to production automatically. Continuous deployment may not be right for every project and even though it sounds great in theory, and I’m sure it is, I’ve not really tried it in a commercial project.

Here’s a nice picture comparing the continuous delivery and deployment processes taken from Yassal Sundman’s blog:

As far as continuous delivery tools are concerned, I don’t have a personal preference. I’ve recently started using AWS CodePipeline (along with AWS CodeDeploy) to automate the delivery process on the AWS cloud and I’m generally satisfied with it.

The Law of Demeter - Writing Shy Code

2016-06-17T00:00:00+00:00

In all my years of building server-side applications, I have come to believe that the single most important aspect that determines the long term success of these projects isn’t the speed of algorithms or the fancy frameworks. Nope. It’s the complexity of the code. Unmanaged complexity has a profound effect on the maintainability of large projects. Applications that are difficult to understand aren’t amenable to refactoring. Introducing new features become a slow and painful process, increasing the crucial time to market. I’ve seen complicated systems where developers were petrified of making even small changes in the fear that they might inadvertently break some other part. Or the spaghetti code that is only understood by a single individual or a handful of developers who get a free pass on anything because the project will be doomed if they quit.

There are multiple factors that contribute to code complexity. One important factor is the amount of coupling or interdependencies between the application’s modules. Let’s walkthrough a trivial example. Suppose there’s a server that allows users to connect. When users connect and authenticate, they are wrapped in a ‘User’ class:

// Represents a user
class User {
  public final String username;
  public final int id;
  public final Socket socket;

  public void disconnect() {
    socket.close();
  }

  // Other methods that operate on the socket.
}

If we want to send a message a user, we would do something like this:

// Another class
class Message {
  // Send "Hello." string to a User
  public void sayHello(User user) {
    Socket s = user.socket; // Get the user's socket
    OutputStream outputStream = s.getOutputStream();
    PrintWriter out = new PrintWriter(outputStream);
    out.println("Hello."); // send the message to the socket.
  }
}

The User class has an obvious flaw: it failed to encapsulate the socket object and leaked it to the world. If we want to change this implementation in the future (e.g. use an asynchronous socket library), we’ll have to make changes in many places.

However, the sayHello(…) method of the Message class isn’t entirely innocent. It sinned in the manner in which it interacted with the socket object. It obtained access to an independent “third-party” object (socket) and used it directly. The example might be contrived, but I have seen this pattern far too many times in “real-world” applications. The good news is that this can be easily detected using a technique with a fancy name: The Law of Demeter. The “law” (the term itself is a misnomer. It’s rather a technique or a guideline) can be summarized as:

Each unit should have only limited knowledge about other units: only units “closely” related to the current unit.

Each unit should only talk to its friends; don’t talk to strangers.

Only talk to your immediate friends.

The fundamental notion is that a given object should assume as little as possible about the structure or properties of anything else (including its subcomponents), in accordance with the principle of “information hiding”.

In short, tight coupling between logically independent modules violates the Law of Demeter. Although, it is a side effect of poor encapsulation, the law discourages direct access and use of third-party objects. Here’s a good example to illustrate the Law of Demeter:

public class Foo {
    /**
     * This example will result in two violations.
     */
    public void example(Bar b) {
        // this method call is ok, as b is a parameter of "example"
        C c = b.getC();

        // this method call is a violation, as we are using c, which we got from B.
        // We should ask b directly instead, e.g. "b.doItOnC();"
        c.doIt();

        // this is also a violation, just expressed differently as a method chain without temporary variables.
        b.getC().doIt();

        // a constructor call, not a method call.
        D d = new D();
        // this method call is ok, because we have create the new instance of D locally.
        d.doSomethingElse();
    }
}

Now scroll back up and look at the sayHello(...) method. It violates the Law of Demeter by talking to a stranger: the socket object. Even though the method itself is pretty much helpless, it helps us detect tight-coupling. The problem starts with the User class that failed to hide its internal details. So let’s fix it:

class User {
  public final String username;
  public final int id;
  // Make the field private to hide it from the world.
  private final Socket socket;

  // Simplified implementation to encapsulate messaging functionality
  void sendMessage(String message) {
    OutputStream outputStream = socket.getOutputStream();
    PrintWriter out = new PrintWriter(outputStream);
    out.println(message);
  }
}

Now we can fix the sayHello(...) method to stop relying on the socket object:

class Message {
  // Doesn't violate the Law of Demeter anymore.
  public void sayHello(User user) {
    user.sendMessage("Hello.");
  }
}

(Coupling) Problem solved. In their book The Pragmatic Programmer, Andrew and Dave suggest writing “shy” code that doesn’t interact with too many things.

I keep an eye out for the Law of Demeter violations when writing or reviewing code and refactor code to reduce unnecessary coupling where it makes sense. This could be automated with source code analyzers. I haven’t personally used it myself so take my advice with a grain of salt. I’ll update this post if I use it myself.

Remote Software Development - Lessons Learned

2016-06-12T00:00:00+00:00

Previously, I talked about the ill-fated rewrite of our core product. The ‘second system’, although better and faster than its predecessor, was rejected by the customer. But the story has a silver lining - one thing that we got right in the process that empowered us to grow and thrive: we built a highly productive remote development team.

In this post, I’ll share my experiences: how through trial and error (and with luck), we finally got the right formula and made it work for us.

Part of the original system team was one of the best duct-tape developers I’ve ever worked with. We called him ‘The Code Machine’. He didn’t write clean code, but he wrote it fast as hell and made sure it worked. The original system, done with a gun to our heads, was a big, complex, monolith application and the boundaries between its modules were ill defined. Computer scientists refer to this type of code as tightly-coupled and the system is referred to as non-orthogonal.

The ‘second system’ had a more grand vision and we needed to grow the team to build it right and on schedule. We hired another developer, a bright, young CS graduate to work on the system. One problem with the original system quickly became evident to me: because the code was tightly-coupled (the auto-generated UML class diagrams resembled a spider’s cobweb), the new developer couldn’t work independently: changing one part of the system indirectly impacted other parts. While they were not bickering, the two developers couldn’t get out of each other’s way and the progress was slow. Parallel to all this, the upper management decided to give ‘remote software development’ a try. Someone on the board recommended a company from Hyderabad, India and before I even knew it, I was interviewing a bunch of remote Java developers.

The non-orthogonality problem was further magnified when the remote developer joined the team. Because the system was tightly-coupled, he was going to have to learn it all before he could become productive. Somehow, we isolated a few ‘modules’ so that he could start writing features without spending months learning the entire system. But his changes frequently conflicted with the work of the local team and integration became a nightmare, requiring everyone’s involvement. This issue aside, I had other concerns as well. It appeared that the remote developer was just ‘covering his ass’ and leaving a paper trail. I hate bureaucracy and don’t work that way. I later learned, that the offshore company had a ‘different’ culture than the one I was trying to build.

Let me stop here to summarize the situation: we had a complex system with huge overlap of module responsibilities. Changes rippled through the whole system and affected the entire team. We had a remote developer in India, outside of our comfort (time)zone, and integration often forced everyone to work odd hours. To make matters worse, the remote developer had low morale and the offshore company we were contracting with, was ripe with dysfunctional politics.

It was then that I decided to slow ourselves down and think about how can we avoid all this in the new design. Since we didn’t have a dedicated Q/A, I assigned that responsibility temporarily to the new developer so he could be productive (it was dick move, but he was allowed time to learn and practice his coding skills and he ended up getting SCJP during that period). The remote developer got re-assigned to a new R&D project that the CEO was cooking up.

When the things calmed down, we realized that three things must happen if we want to scale our development:

The new system must be orthogonal. It should be modular and modules should communicate with other only using well defined messages, the public API. I wanted the modules to be 100% blackbox so they could be developed and tested independently… abstractions that do not leak.
Each module must be assigned to a single developer or a team who will 100% own it. I wanted to get around the Brook’s Law by minimizing communication between local and remote developers as much as possible.
We must do something about the culture and fix the working conditions for our remote developers. We needed the right people who would take full ownership.

1. Building Orthogonal System

In two-dimensional geometry, two lines are orthogonal if they intersect each other at right angles. This affords a kind of independence, where you could move along one of the lines without affecting your position projected onto the other line. This simply means that in an orthogonal system, you could easily change the ‘networking layer’ without affecting the ‘database layer’ or the ‘UI layer’. In Pragmatic Programmer, Andrew and David used an excellent analogy to describe a non-orthogonal system:

You’re on a helicopter tour of the Grand Canyon when the pilot, who made the obvious mistake of eating fish for lunch, suddenly groans and faints. Fortunately, he left you hovering 100 feet above the ground. You rationalize that the collective pitch lever controls overall lift, so lowering it slightly will start a gentle descent to the ground. However, when you try it, you discover that life isn’t that simple. The helicopter’s nose drops, and you start to spiral down to the left. Suddenly you discover that you’re flying a system where every control input has secondary effects. Lower the left-hand lever and you need to add compensating backward movement to the right-hand stick and push the right pedal. But then each of these changes affects all of the other controls again. Suddenly you’re juggling an unbelievably complex system, where every change impacts all the other inputs. Your workload is phenomenal: your hands and feet are constantly moving, trying to balance all the interacting forces.

Helicopter controls are decidedly not orthogonal.

We took the layered approach to system design similar to the OSI model. The layers we built had following qualities:

Each layer had a single major responsibility.
Layers exposed a well-defined API. Communication with other layers was done only using the API.
Layers (with the exception of business logic) communicated with only one other layer.

Here’s what the design of the design looked like:

In addition to the layers, there were several smaller subsystems not shown in the diagram above such as the ‘Admin UI’ tool and simulators for testing the system end-to-end.

2. Circumventing Brook’s Law

Brook’s Law sucks. It basically says that the communication overhead increases as the number of people involved in a project increases. Communication with the remote team that is outside +/- 5 hours of your timezone becomes a burden. We had full 12 hours time difference. So we decided to extend the principal of single responsibility to developers and decided that each layer shall have a single owner or a small sub-team located in the same timezone. Initially, I was little hesitant silos that might affect team work and create local optimums, but we had a great team and it never became an issue. The upper management was concerned about ‘individual owners getting hit by a bus’ and wanted some level of duplicity. To achieve that, we decided that we’ll keep the system as simple as possible, perform regular code reviews and document it well. Our approach to documentation was simple:

For every layer or module, I came up with a high-level product document. The goal was to describe the what and why (not how) in simple and concise terms. The document had a vision that described what the module would do in one sentence and why we were building it, a 10,000 foot overview, core requirements and a list of the things that the module shall not do.
The developers created design documents. They were a brief description of how the project was organized into sub-packages, followed by interaction diagrams. On average, this document was mostly 3-6 diagrams with very little text.

The documentation was kept in the GitHub wiki along with the source code.

So we had divided the system in a manner that individual pieces could be assigned independently to remote teams and the instructions were clear enough to minimize communication. We wanted to assign ownership and get out of their way. But there was a big problem: our remote developer wasn’t motivated enough to take ownership. Growing the team over there was out of the question since we had no control over the environment or the culture.

3. Getting the Right People On the Bus

As the remote developer was finally starting to become somewhat productive, I had concerns about his morale. I reached the conclusion that the things cannot be improved and we terminated the contract.

I was so disheartened at the state of remote software development that I was ready to write it off, but a fellow CTO for a different company told me that they’re getting great results on oDesk. I tried to experiment a little and asked for $1500/month budget that was approved. I found an independent consultant and contracted him on a part-time basis to work on our web admin tool. The guy was a winner. Whenever I assigned him a new task, he asked questions until he fully understood. He would then go away and send me regular updates on the progress. His releases were on time and his code was pristine. He didn’t program in Java, but luckily had (equally smart) friends who did. It’s a long story but he helped us grow the team and we hired four full time developers and he eventually setup his own office to provide them a space to work from.

I visited the remote team once or twice a year. These visits helped me understand the issues these guys faced and the face-to-face whiteboard meetings enabled them to understand more complex designs easily.

Summary

I purposely told the story of how my company built a remote software development team instead of giving recommendations up front in the hope that you would draw your own conclusions. In retrospect, we went through trial and error and luckily found the right people that made it work. But despite all of that, it was a strenuous journey. If the remote developers require constant babysitting and you cannot measure their productivity, it’s probably not working. If you could afford it, steer clear of remote developers working +/- 3 hours outside of your time zone. If you decide to take the offshore route, here are the key takeaways from our experience:

Divide the system into independent layers and minimize the overlap between the layers. Keep the number of layers that have to communicate with one another to a minimum.
Go to extreme lengths to hire the right people; avoid offshore sweatshops or hired guns. Understand that it will require extra time and effort to make them productive. When you find the right people, treat them with the same respect as you would extend to local developers, make them feel like they are part of the family, and make sure they are being compensated fairly. Don’t make them stay up late nights to attend meetings. For example, instead of making four guys stay up late to talk to me, I would schedule the weekly meeting during my evening, their morning.
Provide clear requirements. I cannot stress this enough. Be as concise and as crystal clear as possible. If the requirements aren’t clear, the remote developers might get blocked waiting for answers or even worse, deviate from the requirements and build something entirely different.

IDEs and Productivity

2016-06-10T00:00:00+00:00

I used to be neutral on the choice and even the use of an IDE for writing code. In university, I learned and used Vim for assignments. When I started my first job, I switched to a popular Java IDE because everyone at work was using it and it featured a nice debugger. Other than debugging and basic auto-completion, I didn’t learn the IDE much in terms of its features. But that changed one day when I ran into an issue compiling a Java project. I went over to the developer’s desk who had made the most recent commit. He opened up the project in his IDE and in just a few keystrokes, he had the entire maven dependency graph displayed on his screen. Just like magic. Few more keystrokes and he was able to locate the root cause. I was surprised because after months of using the IDE, I had no idea it had that feature. Two things happened:

I realized that up until that point, I was using the IDE like a text editor and missing out on many of its powerful features.
I became convinced that knowing how to effectively use an IDE has a profound impact on developer productivity. This is especially true for complex enterprise applications, where superior knowledge of IDE can provide a significant boost in productivity.

I started out with Vim and loved it. When I first tried Eclipse/NetBeans for Java development, I was overwhelmed by the number things that were cluttered on the screen.

The second thing that put me off was that I had to use the mouse a lot for navigation. But I saw the value in refactoring, code-completion and generating boiler-plate code so I decided to bite the bullet and switched to it from Vim. In retrospect, it was a good decision.

Many developers who become proficient in text editors like Vim or Emacs don’t switch to an IDE. I worked with a developer who wrote code exclusively in Emacs. When I hired him, he told that he’ll use nothing but Emacs. I was sure that in time I’ll be able to ‘show him the light’ and that he’ll see the value in using an IDE. But after watching him write code in Emacs, I realized that he was obviously an expert in Emacs and had more to lose than gain by starting all over again in an IDE. I didn’t make any attempts to convince him to give up Emacs for an IDE. (On the contrary, he got me thinking that I should perhaps ditch my IDE and switch to Emacs.)

So if you are already using Vim or Emacs and are productive in it, I won’t try to convince you to switch to an IDE. But if you sitting on the fence and don’t have a preference, I’d strongly recommend picking up a good IDE and learning it really well. The choice of the IDE will depend on the language you are are programming in. I have been using IntelliJ IDEA for Java development and highly recommend it. It’s extremely powerful and I feel like I can’t live without the alt-enter shortcut anymore. It impressed Martin Fowler after developers in his company voluntarily switched to it:

The biggest endorsement of IntelliJ came from ThoughtWorks developers. If anyone suggested a standard IDE for ThoughtWorks projects we needed tear-gas to control the riots. There were JBuilder zealots, textpad zealots, slickedit zealots - don’t even get me started on the emacs zealots.

Within six months nearly everyone was using IntelliJ. Voluntarily and eagerly.

But don’t just use IntelliJ like any other IDE. Make an effort to learn and understand its features to enhance the development process. I’d also suggest learning keyboard shortcuts and start relying on the mouse less and less until you don’t need to use it all when writing code. I haven’t completely got there myself, but it is possible to do 100% mouse free development in IntelliJ. Watch this presentation by Hadi Hariri in which he demonstrates how to ditch the mouse and use IntelliJ only with the keyboard: IntelliJ IDEA Tips and Tricks Full Version.

Notes:

Enjoying Java and Being More Productive with IntelliJ IDEA
For Python development, I use PyCharm which is developed by the same guys as IntelliJ IDEA.
I replaced Sublime Text with Atom by GitHub on my machines.
Mark Seemann makes some strong points against the use of productivity tools (including IDEs). You be the judge.

Write Less Code

2016-06-03T00:00:00+00:00

Not too long ago, I sat down to ‘clean up’ a project that I inherited. I was given the reins of the refactoring efforts because the project has had several bugs in production. It was stuck in a vicious cycle where fixing old bugs would introduce new ones. So I dived into the source code one weekend and the problem soon became evident: the project was a big, hairy mess. I use the word big because there was lots of unnecessary, redundant and tightly coupled code. By hairy mess, I don’t mean that the code looked amateur or was full of shortcuts. In fact, the problem was quite the opposite. There was too much magic and everywhere I looked, I saw clever and grandiose design practices that had no relationship with the actual problem that the project was built to solve. Things like reflection, aspect oriented programming, custom annotations were all present. The project was an over-engineered beast. To put it into perspective, after the refactoring was over, the module was reduced to less than half of its original size.

I’m sure the developers who wrote the project did so with the best intentions, but their clever tricks turned against them. They spent a lot of time on periodic maintenance and fixing bugs. The clients were unhappy that the software was full of bugs. The developers felt like shit because everyone was always complaining about the project. But who’s to blame for their misery, for the long hours they had to work to fix the bugs and get no satisfaction out of their jobs? No one else to blame other than the developers themselves. One of my favorite bloggers, Jeff Atwood, wrote that the best code is no code at all:

It’s painful for most software developers to acknowledge this, because they love code so much, but the best code is no code at all. Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options. Code is only our enemy because there are so many of us programmers writing so damn much of it. If you can’t get away with no code, the next best thing is to start with brevity.

Jeff’s point is undeniable. As developers, we have an itch to come up with clever solutions that we think will make us look professional or help us learn a new tool or technology. We build complex layers to solve simple problems and justify them as being “actually necessary”. But we must realize that the more code we write, the more magic we apply, the more opportunities and doors we leave open for the bugs to creep in. These bugs will come back and haunt us or our successors in the form of overtime required to fix them on time. I’m obviously not talking about using slick tricks to reduce the number of lines of code. Rather we should ask ourselves whether we need to write all that code to solve the actual problem. I’ve seen a couple of custom ORMs and handmade thread pools in my career which brings me to another point:

Don’t reinvent the wheel. Pretty please.

But don’t just stop there. Think whether that fancy framework is needed at all. A project I worked on used Hibernate along with the complementary DAO’s and DTO’s to execute one simple, straight-forward query. Another project had a comprehensive event handling system for a filter that used the reflection API to find and invoke the handler class based on the event type. It was an “ingenious” solution and it took me a while to figure out that the unused methods marked by the IDE were actually invoked using reflection. The icing on the cake: the system handled just one type of event. About five classes worth of code could have been condensed into a simple if statement:

if (event.type == THE_ONE_TYPE_THIS_SYSTEM_CAN_HANDLE) {
  process_the_event_and_return_result;
}

The best code is no code at all and the fastest code is the code that never gets executed. Our goal should be to keep our solutions as simple as possible and stay away from our natural tendencies to over-engineer, use clever tricks and design patterns until it can be proven that they are absolutely necessary to solve the problem. Complexity is our worst enemy. Unnecessary complexity, even more so because most of the time, you aren’t gonna need it.

I’m going to end this post with an excellent piece of advice from Jeff:

If you love writing code– really, truly love to write code– you’ll love it enough to write as little of it as possible.

Please share this article if you liked it. It will help the site grow and every share means a lot. Social sharing buttons are below. Thank you!

Why Do Developers Love Music so Much? How to Concentrate in Noisy Open Offices

2016-05-28T00:00:00+00:00

There are a million ways to lose a work day, but not even a single way to get one back - Peopleware

I used to like open floor plans until I had to work in one. On my first day of work at a company with an open office plan, I walked through a sea of conjoined desks to my spot and noticed on the way that over three quarters of the staff in the section had their headphones on. I thought to myself that these guys must love their music. It wasn’t until I settled in and tried to read my emails that I noticed how loud the office was. It then striked me that the poor souls were making a hopeless attempt to block out the noise with their headphones so they could concentrate on their tasks. Not long after, I was reaching into my laptop bag trying to look for my headphones and at the same time trying to find some music on the Youtube making a vain attempt to get in the zone.

At my very first job which was at a startup, I had a semi-private office which I shared with a senior developer. The second office could fit two developers or be used as a meeting area when it was available. The outside area was big enough for 4, perhaps 5 developers. The Operations and the I.T. staff sat in a different section that was separated from the engineering offices by a corridor and couple of doors. The engineering offices were generally very quiet and there were hardly any distractions to developers.

The office layout looked like this:

It was perfect (except that the windows didn’t have much of a view). We had just the right balance of space and the number of people using it.

Getting back to open floor plans, I think that the biggest problem is not so much with the concept, but with the density of people crammed into a small space. With so many people, there will be many visual and audible distractions like phone calls, conversations, skype calls, FedEx delivery guys walking around etc. This makes it impossible for developers to focus especially when they are trying to concentrate on a hard problem. Symptoms of such environment include developers looking for quiet corners in the office or ‘working from home’ in order to finish their projects on time.

I’m not saying that listening to music is bad - I listen to it myself when I’m doing some trivial task like writing a report or when I’m trying to relax. But I do have an issue when developers must use music in order to concentrate on their work. In the Peopleware chapter “Bring Back the Door”, there’s a section on an interesting study conducted at the Cornell University on the effects of listening to music while performing programming tasks:

During the 1960s, researchers at Cornell University conducted a series of tests on the effects of working with music. They polled a group of computer science students and divided the students into two groups, those who liked to have music in the background while they worked (studied) and those who did not. Then they put half of each group together in a silent room, and the other half of each group in a different room equipped with earphones and a musical selection. Participants in both rooms were given a Fortran programming problem to work out from specification. To no one’s surprise, participants in the two rooms performed about the same in speed and accuracy of programming […]

The Cornell experiment, however, contained a hidden wild card. The specification required that an output data stream be formed through a series of manipulations on numbers in the input data stream. For example, participants had to shift each number two digits to the left and then divide by one hundred and so on, perhaps completing a dozen operations in total. Although the specification never said it, the net effect of all the operations was that each output number was necessarily equal to its input number. Some people realized this and others did not. Of those who figured it out, the overwhelming majority came from the quiet room.

Many of the everyday tasks performed by professional workers are done in the serial processing center of the left brain. Music will not interfere particularly with this work, since it’s the brain’s holistic right side that digests music. But not all of the work is centered in the left brain. There is that occasional breakthrough that makes you say “Ahah!” and steers you toward an ingenious bypass that may save months or years of work. The creative leap involves right-brain function. If the right brain, is busy listening to 1001 Strings on Muzak, the opportunity for a creative leap is lost.

Great points.

Taking an action to change the working environment isn’t easy. Most employees and managers who understand that their work environment is counter-productive don’t take an action because they feel that any sort of change is beyond their control or that any attempt to do so would be useless. Unfortunately, this is true. It cost money and some leases even forbid any sort of restructuring or changes to the office layout. But it’s 100% worth it. I have won few battles on this front and lost many. It took over 2 years of convincing and couple of resignations for the stake holders of a software development firm I helped start to move out of a very noisy ‘shared office’ into a quiet one with lots of private and semi-private offices. This “Policy of Default” is illustrated in Peopleware:

A California company that I consult for is very much concerned about being responsive to its people. Last year, the company’s management conducted a survey in which all programmers (more than a thousand) were asked to list the best and the worst aspects of their jobs. The manager who ran the survey was very excited about the changes the company had undertaken. He told me that the number two problem was poor communication with upper management. Having learned that from the survey, the company set up quality circles, gripe sessions, and other communication programs. I listened politely as he described them in detail. When he was done, I asked what the number one problem was. “The environment,” he replied. “People were upset about the noise.” I asked what steps the company had taken to remedy that problem. “Oh, we couldn’t do anything about that,” he said. “That’s outside our control.”

Except noise and interruptions, I have no major beef with open office floor plans. And these things are inherent properties of big open floor plans. Noise makes it difficult to concentrate and interruptions forces developers to context switch and getting back into what they were doing takes time: time that is wasted. Open floor plans are counter-productive and have established a culture of underperformance and overtime where developers regularly stay late or work late evenings from home to finish their tasks. Productivity falls and everyone suffers in the long run. May be it’s time to “Bring Back the Door” and provide a quiet, distraction-free environment for the developers to exercise their creativity.

Avoid Singletons to Write Testable Code

2016-05-27T00:00:00+00:00

Often times there is a need to share a single object of a class throughout the code base. For example, we might want to store all online users in one central registry or share a central queue amongst all producer and consumer objects. We want to:

ensure that exactly one object of a class exists.
provide a way to get that object.

This is a valid and a very common requirement but is often equated and related to a design pattern called Singleton. While they provide a quick and easy solution, singletons are considered bad because they make unit testing and debugging difficult. Brian Button has made some valid arguments against singletons:

[Singletons] provide a well-known point of access to some service in your application so that you don’t have to pass around a reference to that service. How is that different from a global variable? (remember, globals are bad, right???) What ends up happening is that the dependencies in your design are hidden inside the code, and not visible by examining the interfaces of your classes and methods. You have to inspect the code to understand exactly what other objects your class uses.

One of the underlying properties that makes code testable is that it is loosely coupled to its surroundings. This property allows you to substitute alternate implementations for collaborators during testing to achieve specific testing goals (think mock objects). Singletons tightly couple you to the exact type of the singleton object, removing the opportunity to use polymorphism to substitute an alternative.

He’s absolutely right. It’s very difficult to write unit tests for code that uses singletons because it is generally tightly coupled with the singleton instance, which makes it hard to control the creation of singleton or mock it.

Recently I came across a project that made very liberal use of singletons. When I asked the developer about it, he said: “I know singletons are bad. But I needed a single instance of objects in all these cases and had no choice but to use the singleton.” Wrong. Having a single instance is a valid requirement, but singletons are not the only solution. A cleaner alternative would be to create objects in one place like the main() method and pass them to other classes that need them using their constructors. This concept is called ‘Dependency Injection’.

To illustrate these concepts, let’s look at few example. Suppose we want to store all online users in a central registry and provide a way to find out whether a user is online or not by the username. If we use singletons, the code would look like:

public class SomeClassUsingSingleton {

  public boolean isUserOnline(String username) {
    // Obtain singleton instance directly
    UserRegistry userRegistry = UserRegistry.getInstance();
    return userRegistry.get(username) != null ? true : false;
  }

}

Let’s rewrite the above code to use dependency injection instead of singletons. In this example, UserRegistry object is created in the main() method and passed to the class via its constructor:

public class SomeClassUsingDependencyInjection {
  private final UserRegistry userRegistry;

  // Creators of the class pass an instance of UserRegistry
  public SomeClass(UserRegistry userRegistry) {
    this.userRegistry = userRegistry;
  }

  public boolean isUserOnline(String username) {
    return userRegistry.get(username) != null ? true : false;
  }

}

Problem solved. However initializing all objects in the main() method and passing them where they are needed is a mechanical task and increases the number of arguments that declared in constructors. Dependency injection (DI) frameworks and containers like Google’s Guice or Spring are designed to automate this task. You just declare dependencies on objects where you need them and the framework automatically provides or injects them. It’s that simple. Let’s rewrite the above example using Spring:

public class SomeClassUsingSpringDI {

  @Autowire //Will be injected automatically by Spring
  private final UserRegistry userRegistry;

  public SomeClass() {
  }

  public boolean isUserOnline(String username) {
    return userRegistry.get(username) != null ? true : false;
  }

}

And the UserRegistry class is declared as a Spring component:

@Component
public class UserRegistry {
  ...
}

We can also have multiple implementations of the UserRegistry such as one that uses an in-memory data structure to store users and another one that uses an external cache, and tell the DI framework to pick and inject the right one at run-time.

To write testable code, we must separate object creation from the business logic and singletons, by their nature, prevent this by providing a global and a static way of creating and obtaining an instance of their classes. They make it impossible to mock and isolate methods that depend on them for testing. The new operator is no different and same arguments apply. Dependency injection frameworks centralize object creation and separates it from the business logic making it possible to isolate methods and write good unit tests. Projects of all sizes can benefit from DI frameworks. I normally include a DI framework form the very start because it’s tough to resist the singleton temptation half-way through the project when you really need it and don’t have time to mess around setting up DI.

Effective Coding Standards

2016-05-22T00:00:00+00:00

Coding standards are a set of guidelines, best practices, programming styles and conventions that developers adhere to when writing source code for a project. All big software companies have them. Here are few guidelines from the ‘Linux kernel coding style’:

a. Tabs are 8 characters, and thus indentations are also 8 characters.

b. The limit on the length of lines is 80 columns and this is a strongly preferred limit.

c. The preferred form for allocating a zeroed array is the following:
p = kcalloc(n, sizeof(...), ...);
Both forms check for overflow on the allocation size n * sizeof(…), and return NULL if that occurred.

Recently, I came across a blog post from Richard Rodger. In ‘Why I Have Given Up on Coding Standards’, he writes:

Every developer knows you should have a one, exact, coding standard in your company. Every developer also knows you have to fight to get your rules into the company standard. Every developer secretly despairs when starting a new job, afraid of the crazy coding standard some power-mad architect has dictated.

It’s better to throw coding standards out and allow free expression. The small win you get from increased conformity does not move the needle. Coding standards are technical ass-covering.

Oh boy. While I disagree with Richard that coding standards should be abandoned, I share his pain. I briefly worked with a nut job of a “senior developer” who came in as the lead for a project we’d been working on for 6 months. He was an academic who had just finished his PhD and had little experience working on real world projects. He spent first couple of weeks writing “coding standards” in total isolation like he was some kind of a God and we were lowly beings who just weren’t good enough. His coding standards document was full of his personal opinions and promoted some insane form of coding style. The control freak demanded that we update the source code we had already written to reflect his standards. I have never witnessed team morale hit rock bottom so fast. Needless to say, I had a very brief stay at that job.

Another example that I can think of: a manager who insisted on being part of every major code review. During the reviews, he would flag formatting issues, that were almost always a matter of his own preference, as “errors”. The worst part of the story is that he hadn’t written down his coding standards anywhere! I guess he thought developers would learn his style through osmosis. Sometimes, it felt as if he made up rules on the fly. As I mentioned in my post on conducting effective code reviews, it is useless arguing over formatting issues if you don’t have coding standards.

The point is that coding standards are often misunderstood by naive managers and control freaks who misuse them in one way or the other such that it achieves nothing (no one follows them) or causes friction within the team and hurts morale. Many software developers become bitter and start hating coding standards. Coding standards aren’t the problem. Like any other tool, they become harmful when used incorrectly or in the wrong hands. Coding standards that suck have the following attributes:

full of author’s own opinions and personal coding style. Coding standards are not personal agendas.
huge focus on style and formatting issues and is often vague.
recommendations disguised as standards. I have made this mistake. Standards must be treated like rules and hence must be enforceable.

Good software developers and architects understand that coding style is very personal varies from individual to individual. They write coding standards that respect developers’ freedom and allow them to express themselves. They do not attempt to mechanize the whole process, rather they focus on a few well-known practices that are widely accepted or plain common sense. And before any standard is put into practice, they get buy-in from the team, if the team wasn’t already involved in formulating the standards. Here are few examples of good coding standards related to formatting:

No more than one statement per line.
Line length should not exceed 80 or 100 characters.
Test class must start with the name of the class they are testing followed by ‘Test’. E.g. ServerConfigurationTest.
One character variable names should only be used in loops or for temporary variables.

All of these could be easily justified in a black-and-white manner without the enforcer appearing like a dictator. On the other extreme, here are some so called “standards” that will rub developers the wrong way and prompt un-necessary debates:

Class names must not end in -er. [Personal Opinion]
Don’t use static fields or methods. [Personal Opinion]
Aim for low Cohesion and High Coupling. [Recommendation. Cannot be enforced.]
Use Test Driven Development. [Recommendation. Cannot be enforced.]

There is a grey area between common sense guidelines and personal preferences such as whether to put braces on the same line or the next. Standardize if you must, but try to keep items in the grey area (generally formatting issues) to a bare minimum.

Effective Coding Standards

Let’s ask the question: why exactly do we need coding standards and what benefits do they offer? Most articles I found online draw a direct relationship between coding standards and software maintainability. While there is absolutely no doubt that source code that adheres to good standards is more readable and reflects harmony, there is another side of coding standards which is often overlooked at the expense of too much attention on aesthetics. Effective coding standards focus on techniques that highlight problems and make bugs stand-out and visible to everyone. Joel said it better in 2005:

Look for coding conventions that make wrong code look wrong.

In Java programming, having the following standards will help catch bugs early on and increase software quality:

Whenever your override equals() method, you must also override the hashCode() method.
Do not compare strings using == or !=.
Do not ignore exceptions that you caught.
Do not catch broad exception classes like Exception or RuntimeException.

To recap, effective coding standards:

are short, simple and concise rules. They do not attempt to cover and processify everything and leave plenty of room for developers to exercise their own creativity.
strike the right balance between formatting issues and issues that “make the wrong code look wrong”.
are black and white instead of vague suggestions or recommendations.

Coding standards, when used for the right reasons and in the right manner, offer many benefits. They make source code more readable and the software project more maintainable. They also help catch bugs and mistakes that are disguised as seemingly harmless code. While they might not catch all possible bugs, I’ll take something over nothing any day.

Automate the Process of Checking Code Standards

Once you have effective coding standards, you should automate the process of checking source code’s adherence to standards. It will save a lot of time in peer reviews and catch everything that humans might miss. There is an abundance of style checking tools available for all major programming languages. For our Java projects, we use a popular tool called Checkstyle which checks source code for style and design problems. It provides several Checks and by default, checks code against Sun’s conventions. Or you could choose Google’s coding standards, which I recommend. If you use IntelliJ IDEA (also highly recommended), there’s a checkstyle plugin that shows results right inside the IDE.

If you are starting fresh and looking for coding standards, start by searching online for well-known standards for your programming language. Start small and remember that there is more than one right way to style the code. There might already be a standard available that you could borrow. For Java, I personally like Google’s coding standards that I adopted with slight modifications to indentation rules. Google also have coding standards for many other languages. Your coding standards should check for both style issues and design problems. Once you have standards, make sure that they are adopted by the team and automated. However, code that deviates from standards shouldn’t be considered erroneous. It should simply be marked as out of compliance and deviations must be reviewed and fixed.

7 Deadly Sins of Mobile Websites

2016-05-15T00:00:00+00:00

A mobile website is designed and optimized for browsing on a smartphone. They come in various flavours: sometimes there are different desktop and mobile versions, other times they are the same. The focus of this article is on those websites that are specifically designed for “enhancing the mobile user experience”, although more often than not, it has the exact opposite effect.

A mobile site

The sins presented in this article are so common that I’m sure readers must be able to relate to them. The most common complaints people have - or the 7 Deadly Sins of Mobile Websites - are drumroll:

Slow to load
Cluttered with text (Happy talk, Me Talk)
Crappy navigation (Small buttons)
Different content/theme from the Desktop version? (“Honey, where did the categories go?”)
Popups
Auto Redirect
Advertisements and Banners

The 1st deadly sin: Slow to Load

It’s a well-known phenomenon that website loading time is a crucial factor in determining the site’s success. Websites that are slow by virtue of loading high resolution images (even worse, whole carousel full of images), useless piles of Javascript crap, will lead to abandonment.

Wait… wait… wait for it…

I have decided not to sound like a broken record. Instead I will leave you with this rather fine point to ponder: A percentage of users browsing mobile websites are using slower connections like EDGE or 3G, not the fast Wifi we have in our homes and offices. So while a website may take 4 seconds to load on desktop, it might take longer than that to load on the mobile.

The 2nd deadly sin: Cluttered with Text

Websites guilty of this vice try to cram too much un-necessary text to look busy, put happy talk (“Welcome to my sweet, lovely website”) or what I like to call “me” talk (“We are very happy to receive the elusive Greatness-Trophy thanks to the hard work and dedication of our CEO”). The average user who arrived on Acme Widget’s website to check hours of operation don’t give a fat f*ck about their head of marketing’s epiphanies.

Time to bring out the Magnifier

In real life, we like people who are articulate. They speak fluently, coherently and get their thoughts across in a concise manner. Why should websites be any different?

This looks like dog’s hairy behind!

Hidden menus, hover objects, small buttons, small radio buttons, buttons that are too close to each other, the list goes on. Imagine, you’re travelling in a municipal bus which is jerking you around like no roller coaster would, while you are trying to upvote the new kitten video. But you can’t land your finger on the damn button. Every time you try, it opens the link besides that button - every link left or right seems to work and no matter how hard you try. And who do you blame for your misery: your. fat. fingers.

Will I get it to land on the right link this time?

The 4th deadly sin: Different Content/Theme from the Desktop Version

Often times, the mobile website looks so different from the desktop version that I start wondering if I’m the right website. If the mobile website has a very different layout and navigation, it’s asking users to go through another learning curve. No thanks.

Desktop vs Mobile version. Is this the same site? Where did the content go?

The 5th deadly sin: Popups

Ah, the joy of having a huge box in front of my face telling me how good the App is and I should get it from the App Store. The irony: I already have the steaming pile of sh*t app installed and much rather use the website. How about I just leave instead?

The content is behind there. Somewhere.

Chris Lake makes some compelling points against the Popup Syndrome, summarized in this fantastic statement:

A link should be a promise: you click one to be taken to a specific page. That’s just how it is, and it’s what every web user expects (unless programmed to expect something different, e.g. Forbes, which I no longer visit). Websites that lead you down the garden path before fulfilling the promise only serve to disappoint users.

Need I say more? Popups are so bad on mobile websites they should be banned altogether and erased from existence.

The 6th deadly sin: Auto Redirect

At first, it sounds like a novel idea: redirect users to mobile optimized or iPad optimized website based on the type of device/browser they are using. If done right and you have a great mobile website, it works great. The problem is when the mobile website sucks, users who just want to use the desktop version instead will get agitated if they don’t get what they want. I become very annoyed whenever sites ask me to make the very important decision to choose between “The Mobile Version”, “The Desktop Version”, or “Get the App” every single time.

But what makes auto redirect an absolute sin is faulty redirects. As Yoshikiyo Kato writes in this post:

A faulty redirect is when a desktop page redirects smartphone users to an irrelevant page on the smartphone-optimized website. A typical example is when all pages on the desktop site redirect smartphone users to the homepage of the smartphone-optimized site.

For example, in the figure below, the redirects shown as red arrows are considered faulty.

In fact, faulty redirection is so bad that there are reports suggesting that Google penalizes websites guilty of this sin in terms of their page rank.

How could anyone ever even think about jamming in bunch of advertisement even with that much limited screen real-estate? Some people think putting banners on websites is a wise idea. News Flash - Banner ads are trouble. This is 2016, not 1999.

Banner ads suck say guys who invented banner ads. Nuff said.

Last time we had little success with one banner. This time, let’s double up the banners.

Double check ads to make sure they don’t appear like these on smartphones.

Well there you have it, The 7 deadly sins of mobile websites. I hope you found this list useful. If you are guilty of committing these sins, don’t feel bad. As Robert Zoellick said:

All of us make mistakes. The key is to acknowledge them, learn, and move on. The real sin is ignoring mistakes, or worse, seeking to hide them.

If I missed anything or a sin, let me know in the comments section.

Software Estimates are not Targets

2016-05-14T00:00:00+00:00

Software estimation is a hard problem. So much so that in 2012, when Woody Zuill tweeted his blog post with the hashtag #NoEstimates, he set the software development community on fire. Everyone from discontented developers to seasoned software veterans flocked to the discussion on Twitter on either side of the debate. (You can read more about the #NoEstimates movement here.) In this post, let’s try to answer the question whether we need estimates and then look at what software estimates are and more importantly what they aren’t.

Tom Demarco gave the following tongue-in-cheek definition of a software estimate:

“An estimate is the most optimistic prediction that has a non-zero probability of coming true.”

Which brings me to my next point.

Do We Need Estimates?

The short answer is yes, we do.

We need estimates for the following reasons:

To know when something will be done.
To provide a release date.
To allocate cost, resources and people.
To make go/no-go calls: pursue the development or kill the proposal.

We need estimates to make decisions and estimates form the foundation of plans. Good estimates allow us to make sound decisions and to chart a course of action.

What is an Estimate?

Google gives the following definition of an estimate:

roughly calculate or judge the value, number, quantity, or extent of.

an approximate calculation or judgment of the value, number, quantity, or extent of something.

Strictly speaking, the dictionary’s definition is correct but the term has very different connotations in the software world. Estimates are frequently confused with organizational targets and plans. An organizational target is a desirable outcome that benefits the organization in one way or the other. It is an objective that the organization seeks. When you hear statements that sound similar to the following, immediately know that the speaker is talking about the organization’s targets, not estimates:

“The goal is to have the new feature released by May 19th all users.”
“We must have the product ready by the end of this month for our show and tell.”
“We need this feature released in time for Christmas.”
“We must get the module updated ASAP to align with the new regulation.”

While estimates are related to targets, the two are are entirely different beasts. A target is a goal which requires a sound plan to achieve it. In turn plans require estimates. But estimation doesn’t need to look at a target or a plan. Estimates should be completely unbiased. If it requires 6 weeks to update a module to use DynamoDB instead of Cassandra, it will take 6 weeks irrespective of the target which could be 4 weeks or 6 months. Good plans incorporate estimates to figure out how to meet the target. If the target is to have the DynamoDB upgrade complete in 4.5 weeks instead of 6 weeks, the plan might include additional resources to speed up the process.

An estimate is NOT a target. Estimates are unbiased; Plans are heavily biased and rely on estimates to figure out how to meet the target.

I wish I had known the difference early on in my career. Many times I got requests along the following lines:

“Umer, we need this product ready in 2 months or we’ll lose the bid to the other vendor.” - Account Manager

I used to get all worked up at what I felt were unrealistic deadlines and estimates. In hindsight, these non-technical managers were just sharing business commitments and targets. And there’s nothing wrong with it. (In some cases though, I’d have appreciated if they had consulted the team before making any commitments but that’s another story.) Without knowing the difference between estimates and plans, I committed to the imposed deadlines. Even if the product was done on time, there were negative side-effects. The team was overworked, the work was of low-quality and technical debt was accrued. I accepted targets as estimates and as a substitute for a plan. In his book Software Estimation: Demystifying the Black Art, which is a definitive work on the subject, Steve McConnel provides a more productive way to deal with such requests:

EXECUTIVE: How long do you think this project will take? We need to have this software ready in 3 months for a trade show. I can’t give you any more team members, so you’ll have to do the work with your current staff. Here’s a list of the features we’ll need.

PROJECT LEAD: Let me make sure I understand what you’re asking for. Is it more important for us to deliver 100% of these features, or is it more important to have something ready for the trade show?

EXECUTIVE: We have to have something ready for the trade show. We’d like to have 100% of those features if possible.

PROJECT LEAD: I want to be sure I follow through on your priorities as best I can. If it turns out that we can’t deliver 100% of the features by the trade show, should we be ready to ship what we’ve got at trade show time, or should we plan to slip the ship date beyond the trade show?

EXECUTIVE: We have to have something for the trade show, so if push comes to shove, we have to ship something, even if it isn’t 100% of what we want.

PROJECT LEAD: OK, I’ll come up with a plan for delivering as many features as we can in the next 3 months.

(In this example, the actual estimate to finish the project was 5 months.)

Steve gives the following definition of a “Good Estimate” in his book:

A good estimate is an estimate that provides a clear enough view of the project reality to allow the project leadership to make good decisions about how to control the project to hit its targets.

100% agreed. The key take aways from this post are:

Estimates are a crucial part of any plan, and a bad plan is better than no plan at all.
Estimates ≠ Targets.
Estimates are unbiased.
Plans are heavily biased and incorporate estimates to meet targets.

Software estimation is tough. There are many unknowns and uncertainties. It’s like driving on the Bay Bridge to get to San Francisco in dense fog and broken headlights, only to receive a phone call from the hosts that the party has moved to the opposite side. Software estimation is tough, but it’s certainly not magic. This post touched upon the definition of software estimates and the relationship between estimates, plans and targets. Later posts will talk about how to get better at estimating. In the meantime, go buy a copy of Steve’s book. You will not regret it.

Top new Java features in Java 8 and Beyond

2016-05-10T00:00:00+00:00

In this post, we’ll look at some of the latest features introduced in Java 10. Let’s go.

Here’s a list of features covered in this post so you could skip to the one you don’t know about or skip this post entirely if you know them all :-)

The try-with-resources Statement
Catch Multiple Exceptions in a Single catch Block
Underscores in Numeric Literals
Default Methods (in Interface)
Parallel Sorting of Large Arrays
Optional Return Values
Strings in switch statements
The Diamond Operator <>
Annotations Everywhere
Varargs

A word of caution: There’s no harm in doing things the old fashioned way if it is working. It is much better than rushing to use a feature that is not fully understood. But knowledge is power.

Without further ado, let’s get started.

1. The try-with-resources Statement

Prior to Java 7, working with InputStream (and other similar APIs) produced ugly looking code. Here’s an egregious example.

InputStream is = new FileInputStream("unreadme.txt");
try {
    // Read the file
} catch(IOException e) {
   // Handle if an exception occurs while reading the file
} finally {
  try {
    if(stream != null) {
      stream.close();
    }
  } catch(IOException e) {
    // Handle if an exception occurs while closing the file
  }
}

There’s a lot of noise in the code above and twin try-catch statements. It is unnecessarily verbose, even by Java standards.

The try-with-resources statement addresses this issue. You declare any object that implements java.lang.Closeable at the start of the block. When the block exits, Java automatically closes the object by calling its close() method. Let’s rewrite the above example using try-with-resources:

try(FileInputStream is = new FileInputStream("unreadme.txt")) {
  // do something with the file
} catch(IOException e) {
  //...
}

Much cleaner. You can even specify multiple Closeable objects.

2. Catch Multiple Exceptions in a Single `catch` Block

Since Java 7, multiple exceptions can be caught in a single catch block which makes the code less verbose and avoids duplication. Here’s an example of the multi-catch block:

try {
  // Code that throws multiple exceptions
} catch (IndexOutOfBoundsException | IOException ex) {
  logger.error("err", ex);
}

Instead of the usual:

try {
  // Code that throws multiple exceptions
} catch (IndexOutOfBoundsException oobe) {
  logger.error("trouble traversing", oobe);
} catch (IOException ioe) {
    logger.error("problem reading file", ioe);
}

Also, don’t do this:

try {
  // Code that throws multiple exceptions
} catch (Exception e) {
  logger.error("an error occurred", e);
}

Catching Exception is generally a bad idea. The catch block in the example above will catch all exceptions that are thrown in the try body including the ones that it cannot handle. This prevents upper methods in the stack from handling the exception properly.

The only catch is that the exceptions in multi-catch must be disjoint. See this Stack Overflow answer for more details.

3. Underscores in Numeric Literals

Starting from Java 7, you can use underscores in numeric literals to make your code more readable. The underscores can appear anywhere between the digits, except at the start or at the end of literal. Here are some examples taken from Oracle’s tutorial:

long creditCardNumber = 1234_5678_9012_3456L;
long socialSecurityNumber = 999_99_9999L;
long phoneNumber = 123_456_7890L;
long data = 0b11010010_01101001_10010100_10010010;
byte nybbles = 0b0010_0101;
long maxLong = 0x7fff_ffff_ffff_ffffL;
int x6 = 0x5_2;

Although, I’m not sure if it would ever make sense to store credit cards or social security numbers in your code, using underscores certainly make reading hex and binary variables a lot more convenient.

4. Default Methods (in Interface)

Since Java 8, you can include method bodies to interfaces which wasn’t allowed in the previous versions of Java. These methods are known as the default methods because they are automatically included in classes that implement the interface. Default methods were added primarily for backwards compatibility reasons and you should use them judiciously.

This blog has a good overview of the subject.

5. Parallel Sorting of Large Arrays

This one’s my favorite. It allows larger arrays to be sorted faster. It works by dividing a large array into several smaller subarrays and then sorting each subarray concurrently or in parallel. The results are merged back together to form the answer. So instead of,

Array.sort(someArray); // Sorts arrays sequentially

Starting with Java 8, you could use:

Arrays.parallelSort(someArray); // Parallel Sorting

The analysis done for this article of Dr. Dobb’s got 4x better performance:

I loaded the raw integer data from an image into an array, which ended up at 46,083,360 bytes in size (this will vary depending on the image you use). The serial sort method took almost 3,000 milliseconds to sort the array on my quad-core laptop, while the parallel sort methods took a maximum of about 700 milliseconds. It’s not often that a new language release updates the performance of a class by a factor of 4x.

4x!? I’d be very pleased with anything over 2x.

6. Optional Return Values

Java 8 has introduced a container object called Optional for wrapping objects that may not be present or null. A method can wrap its return type in Optional. Let’s look at an example:

public static Optional<User> getUser(String id) {
  // If the user id is NOT FOUND, return null
  return null;
}

And here’s how to call the method:

Optional<User> optional = getUser("jhal1");
if (optional.isPresent()) {
  // User found. Get the value
  User user = optional.get();
} else {
  // No user found
}

In case you are wondering, what’s wrong with just returning the null value? Callers aren’t always aware that method may return null and do not always check for it. This happens frequently and is one of the reasons why the NullPointerExceptions are so plentiful. So, if you are tired of null pointer exceptions, use Optional.

7. Strings in `switch` statements

I almost forgot that Java has a switch-case statement. Prior to Java 7, switch-case statement only worked with integer types (except long) and enums. Java 7 introduced the ability to use a String object as the expression. Here’s how it looks:

String car = getCar();

switch(car) {
  case "Corvette":
  //Handle Corvette
  break;
  case "AC Cobra":
  //Handle AC Cobra
  break;
  case "McLaren F1":
  //Handle McLaren F1
  break;
  default:
  //Handle "Car not Found" error
  break;
}

Which is equivalent to the more cluttered:

if (car.equals("Corvette")) {
  //Handle Corvette
} else if (car.equals("AC Cobra")) {
  //Handle AC Cobra
} else if (car.equals("McLaren F1")) {
  //Handle McLaren F1
} else {
  //Handle "Car not Found" error
}

8. The Diamond Operator `<>`

Diamond operators were introduced to make the use of generics a little less verbose. Take a look at this example:

Map<String, List<String>> aMap = new HashMap<String, List<String>>();

The parameter types are duplicated on the left and right sides of the expression. Since Java 7, you can omit the type definitions on the right side of assignment expressions with a diamond operator, <>. The above statement could be rewritten as:

Map<String, List<String>> aMap = new HashMap<>();

When the compiler encounters the diamond operator (<>), it infers the generic type arguments from the context.

9. Annotations Everywhere

Thanks to Java 8, annotations can be retrofitted almost anywhere in your code. Great, because that’s just what we needed. I’m fine with annotations, but I’ve seen some developers going overboard, trying to do too much magic with annotations. Too much of annotations, just like too much of anything, are bad. You’re probably better off not knowing that this feature even exists. End of my rant.

10. Varargs

Varargs are useful for passing an arbitrary number of parameters to a method. Such as String.format(String format, Object...args). Joshua Bloch in Effective Java recommends using varargs judiciously:

varargs are effective in circumstances where you really do want a method with a variable number of arguments. Varargs were designed for printf, which was added to the platform in release 1.5, and for the core reflection facility (Item 53), which was retrofitted to take advantage of varargs in that release. Both printf and reflection benefit enormously from varargs. You can retrofit an existing method that takes an array as its final parameter to take a varargs parameter instead with no effect on existing clients. But just because you can doesn’t mean that you should!

That’s sound advice. Just because there’s a feature available, doesn’t mean you have to use it.

The char Type in Java is Broken?

2016-05-08T00:00:00+00:00

If I may be so brash, it is my opinion that the char type in Java is dangerous and should be avoided if you are going to use Unicode characters. char is used for representing characters (e.g. ‘a’, ‘b’, ‘c’) and has been supported in Java since it was released about 20 years ago. When Java first came out, the world was a simpler place. Windows 95 was the latest, greatest operating system, world’s first flip phone was just put on sale, and Unicode had less than 40,000 characters, all of which fit perfectly into the 16-bit space that char provides. But things have changed drastically. Unicode has outgrown the 16-bit space and now requires 21 bits for all of its 120,737 characters.

Java has supported Unicode since its first release and strings are internally represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme. For characters that can fit into the 16 bits space, it uses 2 bytes to represent them. For all other characters, it uses 4 bytes. This is great. All possible Unicode characters in existence plus a lot more (1 million more) could be represented using UTF-16 and thus as Strings in Java.

But char is a different story altogether. Let’s look at its definition from the official source:

char: The char data type is a single 16-bit Unicode character. It has a minimum value of ‘\u0000’ (or 0) and a maximum value of ‘\uffff’ (or 65,535 inclusive).

“16-bit Unicode character”? I guess Joel was right:

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

There is no such thing as “16-bit Unicode character”. Please read Joel’s article if you don’t understand the last statement.

char uses 16 bits to store Unicode characters that fall in the 0 - 65,535 which isn’t enough to store all Unicode characters anymore. You might think: Gee, 65,535 is plenty already. I’ll never use that many. That’s true. But your users will. And when they send you a character that requires more than 16 bits, like these emojis 👦👩, the char methods like someString.charAt(0) or someString.substring(0,1) will break and give you only half the code point. And the worst part is that the compiler won’t even complain. Recently, a fellow developer told me that their “North American users” started complaining that the chat nicknames and messages “aren’t displaying properly”. After a lot of grief, they found the issue and had to undo all char manipulation in their software to handle emojis and other cool characters. (Use codePointAt(index) instead which returns an int that will fit all Unicode characters in existence.)

I have heard people say things like: “if internationalization isn’t a concern, you’d probably be fine using char” or “don’t worry about it unless your program is going to be released in China or Japan”.

First, I rarely come across applications where internationalization isn’t a concern anymore. My last three jobs all required internationalization at their core. Second, emojis characters are supported by all popular applications these days. Unicode isn’t just about internationalization anymore.

To be fair to char, it will work fine most of the time for many applications. It isn’t broken but it has a flaw which could ‘break’ your application silently and make your users see garbled text. May be, a UTF-16 character type from Oracle is the answer. Or at least a runtime exception when compiler detects that something bad is about to happen in the interim. Until then, we should probably avoid the char type.

Even its official JavaDocs don’t sound all that convincing to me:

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

🤷

Minimum Viable Product - Lessons for Software Teams

2016-05-07T00:00:00+00:00

The concept of the minimum viable product or the MVP was popularized by Eric Reis in his book The Lean Startup. He defines it as:

The minimum viable product is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort.

Startups fail for many reasons but one of the biggest reasons is that they build products that their customers do not want. By that time, it’s too late. They have already spent months or even years of their lives building a great product - a grand vision - that the market isn’t, unfortunately, willing to purchase. MVP is a strategy to avoid exactly that scenario: building something that the customers do not want.

Eric cites the story of Zappos. They didn’t start out with a grand vision of building a cool website, distribution and call centers. Nope. Zappos started by testing a simple hypothesis: are customers willing to buy shoes online?

They went to local shoe store, took pictures of each of their products and put them online. If anyone bought shoes from them [at this early stage], they planned to go to the store, buy the shoes and mail them to the customer. There was no big business behind it; there was a website and a hope that they’ll get so many orders that it will get annoying to do all the purchasing and shipping manually. It was all to test their big idea.

MVP is a great way to test the actual usage and assumptions as opposed to conventional market research that includes researching online, surveys etc. which often provide misleading results.

Software teams can learn a lot from MVP and apply it to build products that meet their client needs. When I first heard of MVP, I thought of it as a rebranding of ‘Continuous Delivery (CD)’ which, along with the practices of incremental and iterative development, has been around for a long time. But there is a huge difference and it lies in how software developers understand and perceive these concepts. This article captures the definition I hear from most developers when I ask them about incremental development:

In incremental model the whole requirement is divided into various builds … more easily managed modules. Each module passes through the requirements design, implementation and testing phases. A working version of software is produced during the first module, so you have working software early on during the software life cycle. Each subsequent release of the module adds function to the previous release. The process continues till the complete system is achieved.

The article even has an illustration showing how the Mona Lisa would be built incrementally or piece-by-piece:

The problem is that “working software early on in the life cycle” doesn’t happen in real-life. Modules take up a lot of time; developers start arguing about technologies and methodologies, various architectures and database systems are evaluated and the complete painting takes up a long time. When it gets released, the customer turns around and demands that Mona Lisa’s dress should be of red color instead of deep forest green!

Traditional methodologies guide the software development process to ensure that the product gets built right, but the MVP answers the bigger question: are we building the right product? While Agile does promote evolutionary development where results are demonstrated to stakeholders after each iteration, a MVP isn’t just show and tell. It is a real product that gets released to the users. Software teams could build a MVP using Agile/Scrum, incrementally and deliver it continuously.

Because a MVP gets released to the customers, it requires support from the organization, clients and other departments. Building a MVP isn’t easy. On the one hand, you want to limit it to few essential or key features, but on the other hand, you want to deliver something that your customers will find useful so they can start using it. You must find the right balance.

Releasing the product only when it is almost or absolutely ready is a mistake I have seen far too often. Usual software development goes like this: software teams talk to their clients and stakeholders to get their requirements (Everyone does that. Right?). They create a product vision (or write functional specifications) and show it to the stakeholders. Once the stakeholders agree on the requirements and the product vision, the team set out to build the product. The product is divided into several layers or several subs-systems, if following SOA or microservices architecture. Each layer is built piece-by-piece, incrementally and iteratively. Even the CI/CD pipeline is there since week one. Finally, when everything is ready, the product is ‘released’ to the customer. Then the customer flips and ask for changes or new features. It happens almost always. Depending on how well the team captured and modeled their requirements, the rework could take weeks or months!

I made this mistake when I architected a telecommunications project using Service Oriented Architecture (SOA). While the team followed iterative approach to individual services and did continuos integration, we didn’t implement the front-end that our customer could use and provide feedback on early on. We used Agile and it worked great for the database layer, the business layer, the sniffers, the messaging layers and the load balancers. When the product was complete, we released it to our big client for a pilot with few thousand users. That same week, I received a phone call from the CEO asking for very different functionality, in an apologetic tone. It turned out that once the client used the product, they changed their mind. The big release date was one or two months away and we had to do major rework to adopt the change.

What can we learn from this story? Build the MVP. While I’m not against refactoring and adopting to change, in this case, we could have avoided extra effort if we had built a MVP. We didn’t built in a vacuum. We prototyped but prototypes only help so much. In the example I gave, the product’s goal was to support mobile menus that allowed people to send messages and transfer funds. We prototyped and mocked the menus in Balsamiq and even went as far as creating a simple HTML page that allowed the client to interact with the menu options. At regular intervals, we’d send client videos of partial functionality in action. But prototypes and videos aren’t remotely as interesting as the actual product. Clients looked at those and said everything looks great. But once they got the actual product on their phones, they got creative and came up with new ideas. If I were doing the project all over again, the first thing I’d do is build the front-end (menus), use a simple database (even SQLite instead of Cassandra) and make the product work on real devices and give it to the client. Plumbing to make it work across multiple sites and NoSQL databases would come later.

A little while ago, I was working on a backend with REST based microservices architecture. There were several clients with different needs and there was a lot of unknown. A common complaint from the developers who worked on earlier projects was that the clients always change their minds after the release which makes their products messy since they accrue technical debt. I had a Déjà vu. So we figured out the most urgent needs and defined a milestone that would deliver a working product. It’ll have fewer services than our grand vision, but everything will work end-to-end. It won’t be perfect or complete, but it will have just the right functionality they need to get up and running and provide us the feedback. And we will grow and evolve it over time.

The closest reference to MVP I could find in software development is a concept called ‘Tracer Bullets’ that was first used by Andy and Dave in their book the The Pragmatic Programmer:

We once undertook a complex client-server database marketing project … The servers were a range of relational and specialized databases. The client GUI, written in Object Pascal, used a set of C libraries to provide an interface to the servers … There were many unknowns and many different environments, and no one was too sure how the GUI should behave.

This was a great opportunity to use tracer code. We developed the framework for the front end, libraries for representing the queries, and a structure for converting a stored query into a database-specific query. Then we put it all together and checked that it worked. For that initial build, all we could do was submit a query that listed all the rows in a table, but it proved that the UI could talk to the libraries, the libraries could serialize and unserialize a query, and the server could generate SQL from the result. Over the following months we gradually fleshed out this basic structure, adding new functionality by augmenting each component of the tracer code in parallel.

So when you are setting out on a journey to build a new product and have lots of unknowns and assumptions, build a MVP - or use Tracer Bullets. Make sure to build the right thing. You might have seen this image a thousand times already, but this is exactly what MVP will avoid:

Notes

John Mayo-Smith illustrates two approaches to building a pyramid:

Build it gayer by layer
Start with a smaller pyramid and keep growing it. Just like MVP. Delivering a “smaller product” that captures essential features that the customer could actually use and deliver feedback on.

Good Abstractions Have Fewer Leaks

2016-05-06T00:00:00+00:00

Abstraction is one of the greatest visionary tools ever invented by human beings to imagine, decipher, and depict the world. - Jerry Saltz

Abstraction are all around us. We abstract things to hide details making it easier to see the “big picture” and help cope with the complexity. When I push down on the gas pedal, a lot of magic happens under the hood to move my car. But I don’t have to know any of that. I only ever need to know that pushing on the pedal increases the speed and releasing it will decrease it. The gas pedal is a simple interface and that’s what makes it so great. It has been with us for a very long time and it’s here to stay.

In software, like any other engineering discipline, abstractions are everywhere: the protocols, the frameworks, the libraries, the game engines, the file systems, even the programming languages we use everyday are abstractions of low-level languages. It can be argued that everything in computer science is inevitably an abstraction of something more complicated under the hood. The higher-level abstractions provide useful functionality in the form of a black-box, hiding all the complexity and implementation details.

Except that it’s not always so black and white. Software abstractions are never perfect and, arguably, are far from perfect. Abstractions are deficient when you have to understand the low-level details or peek under the covers to understand what’s going. Consider Java’s garbage collection process. It would have been great if I never had to think about how it works - after all, it was supposed to work like magic. But in reality, when a bug was reported that “requests are taking a lot longer to process”, I had to go in and figure out that the garbage collection was slowing things down. One cannot optimize performance and avoid penalties without learning how the garbage collector works (at least not for high-throughput, low-latency applications). Joel Spolsky called this the law of leaky abstractions:

All non-trivial abstractions, to some degree, are leaky.

Abstractions fail. Sometimes a little, sometimes a lot. There’s leakage. Things go wrong. It happens all over the place when you have abstractions.

Joel is right. All abstractions in software are leaky: they attempt to hide away the complexity but the underlying details are not complete hidden. However, the truth is that we can’t live without abstractions. Without abstractions and modules, I can’t even begin to comprehend how any large software project could be maintained. Without Java’s garbage collection, my code will be cluttered with delete statements and memory leaks will be all over the place. The flexibility offered by the the garbage collector works 90% of the time. It leaks because the underlying function of automatic memory management is very complex and highly irregular to which there isn’t a general purpose solution. To its credit, garbage collection provides ways to tweak performance depending on the specific needs.

I don’t think Joel meant that we should abandon abstractions altogether. I guess the point was to embrace the fact that all abstractions are leaky. Edge cases will often require developers to get their hands dirty by understanding the framework to optimize performance or to troubleshoot. And that’s perfectly fine. Good abstractions and frameworks reduce the need for someone to understand the inner workings. Bad abstractions leak a lot and expose their users to all the details. When I heard of Remote Procedure Call (RPC), I loved the concept. It’d allow me to call a method on a remote server as easily as I would a local method. Great. There was a learning curve with Apache Thrift and it was well worth it. The framework, albeit leaky, saved me from writing my own client/server and dealing with the communication complexity (although it was later replaced with REST). Similarly, compared to C++ string library which Joel cites as being leaky, strings in Java are lot less leaky (probably because they are a native feature of the language.)

Leaky abstraction aside, over the years, I have seen both good and terrible abstractions masqueraded as APIs. Not in distant memory, I had to fix an “abstraction” that failed to hide the functionality it provided and had very tight coupling with the applications that used it. 30% of the module’s logic was scattered outside of it, weaved throughout other applications using java annotations. Another time I was reviewing some code and came across some very complex logic. It turned out that the developer that created an “abstraction” over Hibernate to provide an even more general and completely useless solution. Hibernate is already an abstraction over SQL! Adding another layer on top of Hibernate made things even more complex than they needed to be. As David J. Wheeler is quoted:

All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection.

Abstraction are all around us and without abstractions we’d be doomed. After all, if you wish to make an apple pie from scratch, you’d first have to invent the universe. Abstractions provide us the flexibility to work with something very complicated. Imagine the complexity of dealing with blobs of bits and bytes on magnetic platters, or on semiconductor chips, instead of files and databases! Understanding what goes on under the hood from time to time is a good trade-off considering that most of the time, the flexibility offered by a good abstraction would be sufficient.

As software developers, our goal should be to build less leaky abstractions. I also mentioned a few examples of bad and over abstractions that complicate things even more and should be avoided at all costs. In the next post or so, I’ll talk about modular software and how abstractions help software development and when they become a pain.

Software Rot, Entropy and the Broken Window Theory

2016-05-02T00:00:00+00:00

“Complexity is the business we are in and complexity is what limits us.” - Fred Brooks, The Mythical Man-Month

Software projects go through many modifications over their lifetime. As they evolve, the code grows in size and complexity creeps in. Software developers spend a large portion of their time maintaining existing software either by adding new functionality or fixing bugs. Often times, they are forced to take shortcuts to meet deadlines. Developers add new functionality in a ‘quick and dirty’ manner and apply duct-tape to defects. While the organization meets its short-term goal of getting the software out of the door quickly, the code quality suffers and deteriorates. After a while, things start to get really bad. The software becomes so complex and buggy, that it is virtually impossible to maintain. Fixing a bug would introduces more bugs and modifying one part of the software would break several others.

Let’s look at a related concept called software entropy. Entropy is the amount of disorder in a system. It is a physical phenomenon but Ivar Jacobson et al used it to describe the disorder in a software system:

The second law of thermodynamics, in principle, states that a closed system’s disorder cannot be reduced, it can only remain unchanged or increased. A measure of this disorder is entropy. This law also seems plausible for software systems; as a system is modified, its disorder, or [software] entropy, always increases. This is called Software Entropy.

When the ‘disorder’ or the software entropy increases, it leads to software or code rot. The system ends up becoming so complex and disorganized that it is too costly or impossible to maintain. People get frustrated and consider major refactoring or, in some cases, rewriting from scratch. These arduous solutions fix the problem in the short-term but the software will rot again if the team doesn’t adopt a plan for keeping future complexity under control.

While there are many factors that lead to software rot, the most important ones, according to Andrew Hunt and Dave Thomas, are the psychology and the team culture. In their book, The Pragmatic Programmer, they argue that software entropy is contagious and if not controlled, becomes an epidemic.

In inner cities, some buildings are beautiful and clean, while others are rotting hulks. Why? Researchers in the field of crime and urban decay discovered a fascinating trigger mechanism, one that very quickly turns a clean, intact, inhabited building into a smashed and abandoned derelict .

A broken window.

One broken window, left unrepaired for any substantial length of time, instills in the inhabitants of the building a sense of abandonment — a sense that the powers that be don’t care about the building. So another window gets broken. People start littering. Graffiti appears. Serious structural damage begins. In a relatively short space of time, the building becomes damaged beyond the owner’s desire to fix it, and the sense of abandonment becomes reality.

Broken window theory was proposed by criminologists James Wilson and George Kelling and had an enormous impact on police policy throughout the 1990s. While the broken window theory was been widely criticized, I think it makes sense. When shortcuts are taken poor design decisions are made, it sends a signal that no one cares or that no one is in charge. Since its a feature of the environment, even good software developers might fall for it. The solution according to Andrew and Dave is simple:

Don’t leave “broken windows” (bad designs, wrong decisions, or poor code) unrepaired. Fix each one as soon as it is discovered. If there is insufficient time to fix it properly, then board it up. Perhaps you can comment out the offending code, or display a “Not Implemented” message, or substitute dummy data instead. Take some action to prevent further damage and to show that you’re on top of the situation.

A simple comment around ugly code stating that its ugly and needs to get fixed soon is better than nothing. Code can quickly rot once windows start breaking. Even a mere perception of disorder could result in total chaos. Don’t leave broken windows, fix them as soon as you could. Entropy will creep in - don’t let it win.

Do Experienced Programmers Use Google Frequently?

2016-04-30T00:00:00+00:00

Software developers, especially those who are new to the field, often ask this question or at least wonder whether they are good developers or just good at googling up solutions.

“Do experienced programmers use Google frequently?”

The resounding answer is YES, experienced (and good) programmers use Google… a lot. In fact, one might argue they use it more than the beginners. Using Google doesn’t make them bad programmers or imply that they cannot code without Google. In fact, truth is quite the opposite: Google is an essential part of their software development toolkit and they know when and how to use it.

A big reason to use Google is that it is hard to remember all those minor details and nuances especially when you are programming in multiple languages and using dozens of frameworks. As Einstein said:

“Never memorize something that you can look up.” - Albert Einstein

==> Tweet This Quote <==

Aside from that, good programmers also know that they cannot be the first one to have encountered a problem. They use Google to research possible solutions, carefully evaluating the results and consciously separating the wheat from the chaff; they don’t blindly follow or copy-paste any solution they come across. Expert programmers are also paranoid, living in self-doubt and questioning their competence. Whenever their spidey senses start tingling, they know they may be going the wrong hole; they rely on Google on validate their logic.

Going by the definition, I would be considered an experienced programmer. Recently, I had to write web server using Netty in Java to handle persistent sockets from mobile games. I had never used Netty before. Here are my Google searches I did:

1. netty tutorial 2. netty maven dependency 3. netty bytebuf to string 4. netty bytebuf release 5. netty 4 changes 6. setOption(“child.bufferFactory”) netty 4
7. ByteBuf netty 8. opensource projects using netty framework
9. netty 4 examples 10. netty 4 adding json encoder 11. netty channel pipeline 12. netty 4 messagetomessage encoder 13. netty serverbootstrap childhandler 14. ByteBuf netty 15. lengthfieldbasedframedecoder netty 4 16. netty 4 client examples 17. netty 4 bytebuf to bytebuffer 18. netty 4 endianness 19. netty channelhandlercontext 20. netty channelhandlercontext thread safe 21. netty user authentication 22. netty heartbeat handling 23. load test netty with 10k concurrent sockets

I wrote 255 lines of code that included a working server and a client. I queried google 23 times mostly landing on StackOverflow, Netty 4 website, GitHub, and JavaDocs. If you do the math, that averages out to 1 query every 10 lines of code! I had no idea. Let me know in the comments what your average is.

So sit back, relax and remember that Google is software developer’s best friend.

How often do you use Google when programming? Do you have any Google power tips that you want to share with others? Just leave a comment below.

Do Not Let Technical Debt Get Out of Control

2016-04-27T00:00:00+00:00

Technical debt is a useful metaphor for describing the consequences of adding new functionality to a system in a quick and dirty manner to get something out of the door faster. The proper way would have resulted in a much cleaner design and implementation, but would also have taken much longer. Martin Fowler calls technical debt a wonderful metaphor:

Technical Debt is a wonderful metaphor developed by Ward Cunningham to help us think about this problem. In this metaphor, doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice.

Taking on technical debt should be a strategic decision where all stakeholders must understand the consequences and risks involved. Like most financial debts, it should not be taken recklessly and interest payments must be paid on time to avoid penalties.

While technical debt has negative connotations, it is an unavoidable reality for many software projects. In her book on Practical Object-oriented Design in Ruby: An Agile Primer, Sandi Metz wrote:

Sometimes the value of having the feature right now is so great that it outweighs any future increase in costs. If lack of a feature will force you out of business today it doesn’t matter how much it will cost to deal with the code tomorrow; you must do the best you can in the time you have.

At Starscriber, we accrued technical debt from time to time to take advantage of new business opportunities and tried to pay it off as soon as the dust settled. But we didn’t always succeed. There were at least two projects where the debt got out of control. Implementing (or hacking, would be a better word) new features was a complex and painful process for everyone involved: developers, testers and operations teams. The change requests didn’t stopped coming and we made the mistake of not acknowledging that we are accumulating way too much technical debt and letting it drag on for way too long. The result? It became a huge burden and required a lot of effort just to keep the system running.

However, in my opinion the biggest casualty wasn’t the productivity; it was the team culture and the morale. People got demotivated since they took no pride in their work. They couldn’t: no creativity or learning was involved, other than finding creative ways to ‘hack’ and make it work. And since the project required tribal knowledge to understand all the hacks, we had to keep the ‘demotivated’ team together until we could take it no more and had to halt new development to do major refactoring and testing. Some teams consider rewrite from scratch when facing this predicament, which is almost always a big mistake.

The lessons to be learned are:

1. Acknowledge

Technical debt is inevitable and will become a major problem if ignored. We got sidetracked in the pursuit of pleasing a big client and didn’t acknowledge or realize that we are accruing big time technical debt which later caused bugs and slowed down our ability to add new features.

2. Decide Strategically: Understand Short-Term vs Long-Term Consequences

In your career, it will often make sense to ship a subpar system to gain market advantage. In fact, it would be a mistake not to. In our last startup, we made the opposite mistake and didn’t deliver a minimum viable product on time. As a result, we lost out on crucial early feedback. The trick is that all stakeholders must understand and strategically prioritize the tradeoffs of speed vs quality. Non-technical stakeholders often don’t understand the concept and Steve McConnell has some good advice on how to educate them:

Technical staff should build on the technical debt metaphor as a way to talk to business staff to explain that if they take on short-term technical debt, they will need to pay it off or else it will end up costing the business on the long-term. For example: “If we spend X weeks working on this particular infrastructure area, it will allow us to add features A, B, and C. Although the work itself does not show immediate benefit, it will open the door for other work that will produce business benefits later on.” Now this becomes a productive discussion and we have a reason for the business to engage.

3. Don’t Let it Get out of Control: Pay the Debt

The longer you wait, the higher interest payments get. If you let technical debt accrue and not pay if off, future development will stall, code quality will suffer, tribal knowledge will be required to understand all the hacks and people working on the project will become demotivated. Have a plan for paying off the technical debt to avoid massive interest payments in the future. It should be a part of your normal development process.

Taking on technical debt is risky business: it gives you the short-term benefits, but you’ll have to pay the debt back with interest in the future. Interests will keep on accruing and the more you delay paying the debt off, the higher the interest payments are going to be. Dealing with technical debt is not always easy: upper management often don’t see the value since it doesn’t result in new features; Things break and you might even have to throw some code out. Embark on the journey and have faith that you are doing the right thing and that your system will be in a better shape than it was before.

In the next post or so, we’ll look at Software Entropy. Please like and follow us on Facebook and Twitter to stay up-to-date.

What is HTTP/2?

2016-04-23T00:00:00+00:00

There are many reasons to feel excited about HTTP/2. It is the first major update of the HTTP protocol in 16 years! It was long overdue as the web dramatically evolved over the years. HTTP/2 is aimed at making the web faster and overcome many shortcomings of HTTP/1.1. HTTP/2 brings advancements to speed, efficiency, standardization and security.

HTTP/1.1: Simple Protocol, Complex Workarounds

HTTP was envisioned as a simple, application level, request-response protocol. Clients connect to servers and make HTTP requests asking for resource. Servers send response back and terminate the connection.

HTTP/1.1 was released in 1999 to keep up with the performance demands of 90’s brochure-esque websites. Its major improvement over its predecessor, the HTTP/1.0, was that it allowed a connection to be reused to send multiple requests and responses. This improves performance by eliminating connection setup overhead for each request.

In 2004, Web 2.0 ushered in the new era of rich user experience and collaboration. It allowed users to interact with websites, leave comments, indulge in social sharing and enjoy many new features. New technologies emerged and many Web 2.0 websites like Wikipedia, Flickr and YouTube went on to become huge successes.

While all this was going on, the protocol that ran it all, HTTP/1.1, was having trouble keeping up. The websites were growing significantly in size and complexity and it wasn’t designed to handle that. So smart people did what they do best: they invented workarounds, so-called best practices, to overcome HTTP/1.1 limitations. Hacks like request pipelining, domain sharding, sprite sheets and data inlining were used to optimize performance. This added complexity on top of HTTP/1.1 and introduced regressions like unnecessary downloads, poor caching etc. Lack of standards meant that web developers had to deal with different browsers and versions. Ugh! I remember countless hours I sunk into tweaking websites to look great on both Firefox and Internet Explorer 7. It worked, but it was messy. And painful.

The Journey to HTTP/2: A SPDY Stepping Stone

With no new development of HTTP/1.1 on the radar, Google took matters into their own very capable hands and started SPDY (pronounced Speedy):

SPDY is a replacement for HTTP, designed to speed up transfers of web pages, by eliminating much of the overhead associated with HTTP. SPDY supports out-of-order responses, header compression, server-side push, and other optimizations that give it an edge over HTTP when it comes to speed.

In reality, SPDY didn’t really replace HTTP/1.1 (and neither does HTTP/2 as we’ll later see). Augmented would be the right word. SPDY sat on top of HTTP/1.1 and heavily modified the data transfer formats and connection handling.

Google released SPDY in 2010 in Chrome 6 and soon deployed SPDY across all Google services. Word spread and SPDY soon gained traction and support from the community and vendors like Mozilla, Nginx, Microsoft and Facebook. Internet Engineering Task Force (IEFT), responsible for HTTP standards, seized the opportunity and published HTTP/2 standards in 2015 deriving heavily from SPDY; SPDY’s fingerprints are all over HTTP/2.

Google has announced that it will stop supporting SPDY on May 15th, 2016. Adios SPDY, you will be remembered as an important stepping stone on the journey to HTTP/2.

HTTP/2 Features

HTTP/2 is here. Every modern web browser now supports HTTP/2. All major cloud and CDN vendors support it. The adoption among websites isn’t significant but it’s growing. As of April 24, 2016, 7.2% of all websites use HTTP/2.

HTTP/2 brings many improvements to HTTP/1.1 and the biggest ones you need to know are:

1. Multiplexing

Under HTTP/1.1, connections were persistent which allowed multiple requests to be sent or pipelined over the same TCP connection. While it improved performance by reducing connection establishment overhead, it wasn’t a silver bullet. Even though multiple requests could be sent over the same connection, the responses must arrive synchronously in the same order as they were requested. This means an expensive resources (e.g. loading a large image file) will block a lightweight response if requested in the wrong order. This phenomenon is known as the head-of-line (HOL) blocking. In fact, pipelining was so poorly supported by web servers that many web browsers simply disabled it.

HTTP/2 allows multiplexing that solves HOL issue by allowing responses to be arrived out of order, thus eliminating the need to open multiple connections.

2. Compression & Metadata to Reduce Header Overhead

Each and every request and response under HTTP/1.1 has a header typically between 200 bytes to 2KB in size. The first issue is that the headers are not permitted to be compressed. Another issues with HTTP/1.1 headers is that they contain a lot of redundant information that is exchanged several hundred times as browsers make as many requests to load a webpage. Static headers like Accept* and User-Agent only need to be exchanged once.

HTTP/2 fixes both these problems by compressing and eliminating unnecessary headers. High five.

3. Server Push

HTTP/2 server can send data to a client before the client even asks for it. To understand why it is beneficial, let’s understand how a webpage is loaded under HTTP/1.1: web browser requests a web page, waits for it to be downloaded, parses it to find all linked assets such CSS and JavaScript and then make separate requests to download these asserts.

HTTP/2 Server Push allows the server to send these files to the browser proactively with the first request knowing that it will need them to display the page. I’m sure there’s a way to tell the server to stop being so proactive if the web browser already has files in its cache.

4. Encryption is Not Really Optional

People debated for a long time whether or not encryption (HTTPS) should be made mandatory in HTTP/2. In the end, the standards folks decided not to make it mandatory, with a fair bit of warning:

After extensive discussion, the Working Group did not have consensus to require the use of encryption (e.g., TLS) for the new protocol. However, some implementations have stated that they will only support HTTP/2 when it is used over an encrypted connection, and currently no browser supports HTTP/2 unencrypted.

So while the use of TLS is not imposed by the standard, browser vendors have made it a requirement. Google Chrome and Mozilla Firefox have pledged to not support HTTP/2 without HTTPS.

5. Binary Encoding

Unlike HTTP/1.1, HTTP/2 uses binary encoding. Without getting into binary vs text protocols debate, this means that HTTP/2 will be more efficient to parse and compact on the wire but will no longer be human readable. When I was learning HTTP, one of the first things I did was made a request by hand and looked at the response along with all the headers as it arrived in all its glory. Unfortunately, this won’t be possible in HTTP/2 - at least not without specialized tools.

5. Backwards Compatibility

HTTP/2 is backwards compatible with HTTP/1.1. This means if you want to upgrade to HTTP/2, you could do so without changing anything. The upgrade will be equally seamless to your users. Don’t forget to undo HTTP/1.1 performance optimizations (aka best practices) as they no longer provide the same benefits.

HTTP/2 - Future is Fast… And Complex

People have criticized HTTP/2 for being Google’s idea, optimized for their needs and having needless complexity (things like flow controls to prevent DOS attacks). One has to agree that HTTP/2 is undoubtedly more complex than its predecessors, but the complexity is a necessary evil to keep up with today’s needs. An average webpage in 2016 is the same size as the original DOOM shareware binary. Sure Google benefits from HTTP/2, but don’t we all? It is time to move beyond HTTP/1.1 and I believe HTTP/2 is the answer.

When to Rewrite from Scratch - Autopsy of a Failed Software

2016-04-21T00:00:00+00:00

It was winter of 2012. I was working as a software developer in a small team at a start-up. We had just released the first version of our software to a real corporate customer. The development finished right on schedule. When we launched, I was over the the moon and very proud. It was extremely satisfying to watch the system process couple of million of unique users a day and send out tens of millions of SMS messages. By summer, the company had real revenue. I got promoted to software manager. We hired new guys. The company was poised for growth. Life was great. And then we made a huge blunder and decided to rewrite the software. From scratch.

Why We Felt That Rewrite from Scratch Was Needed?

We had written the original system with a gun to our heads. We had to race to the finish line and incurred technical debt. We weren’t having long design discussions or review meetings - we didn’t have time for such things. We would finish a feature as quickly as we can, get it tested and release it to the customer. We had a shared office (from TRTech) and I remember new software developers at other companies getting into lengthy design and recurring architecture debates over design patterns, something we couldn’t afford to do.

Despite agile-on-steroids design, the original system wasn’t badly written and generally was well structured. There was some spaghetti code that carried over from company’s previous proof of concept attempts that we left untouched because it was working and we had no time. But instead of thinking about incremental improvements, we convinced ourselves that we need to rewrite from scratch because:

the old code was bad and hard to maintain.
the “monolith java architecture” was inadequate for our future need of supporting a very large operator with 60 million mobile users and multi-site deployments.
I wanted to try out new, shinny technologies like Apache Cassandra, Virtualization, Binary Protocols, Service Oriented Architecture, etc.

We convinced the entire organization and the board and sadly, we got our wish.

The Rewrite Journey

The development officially began in spring of 2012 and we set end of January, 2013 as the release date. Because the vision was so grand, we needed even more people. I hired consultants and couple of remote developers in India. However, we didn’t fully anticipate the need to maintain the original system in parallel with new development and underestimated customer demands. Remember I said in the beginning we had a real customer? The customer was one one of the biggest mobile operators in South America and once our system had adoption from its users, they started making demands for changes and new features. So we had to continue updating the original system, albeit half-heartedly because we were digging its grave. We dodged new feature requests from the customer as much as we can because we were going to throw the old one away anyways. This contributed to delays and we missed our January deadline. In fact, we missed it by 8 whole months!

But let’s skip to the end. When the project was finally finished, it looked great and met all the requirements. Load tests showed that it can easily support over 100 million users. The configuration was centralized and it had a beautiful UI tool to look at charts and graphs. It was time to go and kill the old system and replace it with the new one… until the customer said “no” to upgrade. It turned out that the original system had gained wide adoption and their users had started relying on it. They wanted absolutely no risks. Long story short, after months of back and forth, we got nowhere. The project was officially doomed.

Lessons Learnt

You should almost never, ever rewrite from scratch. We rewrote for all the wrong reasons. While parts of code were bad, we could have easily fixed them with refactoring if we had taken time to read and understand the source code that was written by other people. We had genuine concerns about the scalability and performance of the architecture to support more sophisticated business logic, but we could have introduced these changes incrementally.
Systems rewritten from scratch offer no new value to the user. To the engineering team, new technology and buzzwords may sound cool but they are meaningless to customers if they don’t offer new features that the customers need.
We missed real opportunities while we were focused on the rewrite. We had a very basic ‘Web Tool’ that the customer used to look at charts and reports. As they became more involved, they started asking for additional features such as real-time charts, access-levels, etc. Because we weren’t interested in the old code and had no time anyways, we either rejected new requests or did a bad job. As a result, the customer stopped using the tool and insisted on reports by email. Another lost opportunity was an opportunity to build a robust Analytics platform that was badly needed.
I underestimated the effort of maintaining the old system while the new one is in development. We estimated 3-5 requests a month and got 3 times as many.
We thought our code was harder to read and maintain since we didn’t use proper design patterns and practices that other developers spent days discussing. It turned out that most professional code I have seen in larger organizations is 2x time worst than that we had. So we were dead wrong about that.

When Is Rewrite the Answer?

Joel Spolsky made strong arguments against rewrite and suggests that one should never do it. I’m not so sure about it. Sometimes incremental improvements and refactoring are very difficult and the only way to understand the code is to rewrite it. Plus software developers love to write code and create new things - it’s boring to read someone else’s code and try to understand their code and their ‘mental abstractions’. But good programmers are also good maintainers.

If you want to rewrite, do it for the right reasons and plan properly for the following:

The old code will still need to be maintained, in some cases, long after you release the new version. Maintaining two versions of code will require huge efforts and you need to ask yourself if you have enough time and resources to justify that based on the size of the project.
Think about losing other opportunities and prioritize.
Rewriting a big system is more risky than smaller ones. Ask yourself if you can incrementally rewrite. We switched to a new database, became a ‘Service Oriented Architecture’ and changed our protocols to binary, all at the same time. We could have introduced each of these changes incrementally.
Consider the developers’ bias. When developers want to learn a new technology or language, they want to write some code in it. While I’m not against it and it’s a sign of a good environment and culture, you should take this into consideration and weigh it against risks and opportunities.

Michael Meadows made excellent observations on when “BIG” rewrite becomes necessary:

Technical

The coupling of components is so high that changes to a single component cannot be isolated from other components. A redesign of a single component results in a cascade of changes not only to adjacent components, but indirectly to all components.

The technology stack is so complicated that future state design necessitates multiple infrastructure changes. This would be necessary in a complete rewrite as well, but if it’s required in an incremental redesign, then you lose that advantage.

Redesigning a component results in a complete rewrite of that component anyway, because the existing design is so fubar that there’s nothing worth saving. Again, you lose the advantage if this is the case.

Political

The sponsors cannot be made to understand that an incremental redesign requires a long-term commitment to the project. Inevitably, most organizations lose the appetite for the continuing budget drain that an incremental redesign creates. This loss of appetite is inevitable for a rewrite as well, but the sponsors will be more inclined to continue, because they don’t want to be split between a partially complete new system and a partially obsolete old system.

The users of the system are too attached with their “current screens.” If this is the case, you won’t have the license to improve a vital part of the system (the front-end). A redesign lets you circumvent this problem, since they’re starting with something new. They’ll still insist on getting “the same screens,” but you have a little more ammunition to push back. Keep in mind that the total cost of redesigning incrementally is always higher than doing a complete rewrite, but the impact to the organization is usually smaller. In my opinion, if you can justify a rewrite, and you have superstar developers, then do it.

Abandoning working projects is dangerous and we wasted an enormous amount of money and time duplicating working functionality we already had, rejected new features, irritated the customer and delayed ourselves by years. If you are embarking on a rewrite journey, all the power to you, but make sure you do it for the right reasons, understand the risks and plan for it.

Git Stash - Saving Your Changes

2016-04-18T00:00:00+00:00

Let’s say you are in the middle of implementing a new feature. You’re half way through your changes and the code is in a messy state. You get a message that there’s an urgent issue that requires you to switch gears and work on it immediately. You don’t want to commit your half baked changes but also don’t want to lose your work because you want to revisit it at a later time? What do you do?

The answer to this problem is the git stash command.

Running git stash will take the changes you’ve made to tracked files in the working directory as well as staged changes and saves them to a stack. You can reapply these changes from the stack at any time. After stashing, you’ll end up with a clean working directory and can freely switch branches and work on something else. Let’s walk through a complete example.

Let’s say we’re in the middle of editing file.txt when we get a Slack message to switch to something else right away. We’ll use git stash to save our changes.

$ echo "Improvement 1 of 3" >> file.txt
$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
          modified:   file.txt
$ git stash save "Partial improvement to file.txt"   
$ git status
# On branch master
nothing to commit, working directory clean

After stashing, you’re free to do whatever you like. You can switch to a different branch, make your changes, commit and push them to the remote repo. After you’re done, you’re ready to resume working on file.txt which you stashed away. To reapply, you will use the git stash apply command.

$ git stash apply

Assuming you don’t run into merge issues, you should get your partial changes back. If not, git will throw a merge conflict error if it can’t reapply your changes safely. We’ll look at how to resolve merge conflicts in the next chapter.

You can stash multiple times. Each time you run git stash it will save a new stash on the stack. To see a list of all the stashes stored on the stack, use the git stash list command.

$ git stash list
stash@{0}: WIP on master: d724198 partial improvement 2
stash@{1}: WIP on master: d724198 bug fix for Unity
stash@{2}: WIP on master: c9a03f4 added partial improvement 1

The stashes are ordered by most recent to newest (hence it is a stack.) stash@{0} is the most recent and stash@{2} is the oldest in the example above. To restore the very first stash you saved:

$ git stash apply stash@{2}

A stash could be applied to any branch not just the same branch it was saved from. Also note that stash will ignore ‘un-tracked’ files. If you added a new file, you must first add it to the index using git add before stashing.

Merge conflicts

There are times when git stash apply won’t work and throw a merge conflict. E.g.

$ git stash apply
error: The following untracked working tree files would be overwritten by merge:
	README.md
Please move or remove them before you merge.
Aborting

The easiest way to get out of merge conflicts is to apply your stash to a new branch. To do this you can use git stash branch <new branchname>. This will create a new branch, check out the commit when you stashed your changes, reapply your stash on top of it.

$ git stash branch temp_restore

That’s pretty much it. Here are some bonus tips for git stash:

To save your stash with a message or give it a name you can use the following syntax: git stash save <message>. For example: git stash save "feature orca-654".
By default, stash doesn’t save untracked files. You could either stage them or save with the -u switch e.g. git stash save -u
To delete the stash after it has been applied, you can use the git stash pop command e.g. git stash pop stash@{2} will apply stash@{2} and delete it from the stack.
To delete a stash without applying it, use git stash drop e.g. git stash drop stash@{0}
Use git stash show to see a summary of diffs.
To delete all your stashes, use the git stash clear command. Be careful because this is dangerous command. You will lose your stashes forever.

That’s all. If you enjoyed this article, please share it with your friends. The links to share on Facebook, Twitter, LinkedIn are below.

What's the difference between git fetch vs git pull?

2016-04-18T00:00:00+00:00

Git has two types of repositories: local and remote. The local repository is on your computer and has all the files, commit history etc. Remote repositories are usually hosted on a central server or on the Internet.

Downloading data from the remote repo to local is an essential part of working with git.

Both git pull and git fetch are used to download data from remote repository. These two commands have important differences and similarities. Let’s explore them in more detail.

The main difference between `git fetch` and `git pull`

$ git fetch origin

git fetch only downloads the latest data from the remote repository. It does not merge any of this new data into the current branch or changes the working files. Fetch is safe because it only downloads new changes (since you last synced with the remote.) It doesn’t create conflicts or interfere with the work in progress. Developers use fetch to find out if there have been new changes, review changes before merging or sometimes to track someone else’s feature branch.

$ git pull origin master

git pull in contrast not only downloads the latest data, but it also automatically merges it into your current branch and updates the working files automatically. It doesn’t give you a chance to review the changes before merging, and as a consequence, ‘merge conflicts’ can and do occur. One important thing to keep in mind is that it will merge only into the current working branch. Other branches will stay unaffected.

Here’s a diagram to illustrate the difference between git fetch and git pull.

I have covered the main difference between git fetch and get pull above. But if you want more details, read on.

`git fetch` explained in detail

As we’ve seen, git fetch only downloads latest changes into the local repository, and does not merge into the current branch. It downloads fresh changes that other developers have pushed to the remote repo since the last fetch and allows you to review and merge manually at a later time using git merge. Because it doesn’t change your working directory or the staging area, it is entirely safe, and you can run it as often as you want.

You may be thinking where are the changes are stored after a fetch since they are not merged into the working file? The answer is that they are stored in your local repository in what is called remote tracking branches. A remote tracking branch is a local copy (or mirror) of a remote branch, e.g. origin/master. You’ll need to run git branch -a to see all local and remote branches. After doing a fetch, if you want to check out what has changed, you can do:

git log origin/master ^master: to get a list of all commits that are in remote master but not in your local branch.
git diff ..origin: to see the diff.
git checkout origin/master: to checkout the remote master branch and see what files have changed.

Once you have reviewed the changes and are ready to merge, you can switch back to the master branch and run git merge. It will merge changes from the remote branch into local.

Here’s an example. Let’s switch to develop branch and do git fetch. To keep it simple, I’ll omit the output.

$ git checkout develop
$ git fetch

Now let’s see the list of commits that are in remote develop but not in local.

$ git log origin/develop ^develop

commit 6123ef537f0dac5410f409a8dfc2719491e13fc9 (origin/master, origin/HEAD)
Author: Umer Mansoor <...@gmail.com>
Date:   Sat Feb 1 08:11:52 2020 -0800

    fixed toc

commit 7143fccddce97405b05f51facf9e1560301027ab
Author: Umer Mansoor <...@gmail.com>
Date:   Sat Feb 1 08:10:01 2020 -0800

    Update README.md

When you are ready to merge, simply run:

$ git merge

Updating 89aaded..5926bf5
Fast-forward
 README.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

`git pull` example

The git pull command downloads from the remote repository to the local repository and automatically merges those changes into the current branch.

$ git checkout master
...

$ git pull

remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/repo/arepo
   7548eeb..8599bfe  master     -> origin/master
Updating 7548eeb..8599bfe
Fast-forward
 README.md | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 README.md

You can see that in the output above, the changes were downloaded from the remote master and then merged using fast-forward method into the local master branch.

Summary

In summary, pull and fetch are similar in the sense that they both download latest changes from the remote repo to local. The difference is that pull automatically merges changes into the current branch while fetch doesn’t. If you are interested, here are some more commands and details that I didn’t cover in the post to keep it simple.

Under the hood, git pull is equivalent to running git fetch origin HEAD followed by git merge HEAD.
To run git pull in verbose mode, add the verbose switch i.e. git pull --verbose
To put all your changes on top of what everyone else has committed, you can pull using the rebase flag i.e. git pull --rebase origin.

If you’re interest, check out my next tutorial on git stash. Stashing is a useful tool in git that allows users to save their partially complete changes and reapply them at a later time.

The Problem With Code Coverage Metrics

2016-04-16T00:00:00+00:00

Code coverage is a valuable metric. Software developers write tests for the code they have written and run code coverage analysis which gives them an assessment of how much of their code is covered by tests or more importantly what parts of the code are untested.

Some organizations and managers make high level of code coverage mandatory for their teams. “90% code coverage and no less!”, says the manager. It becomes another metric in management’s arsenal to assess code quality and (god forbid) team’s performance. It’s a big mistake to interpret and use code coverage in this way. Code coverage doesn’t say anything about the quality of the code or the tests. It is very easy to get high code coverage with low quality testing. Code coverage number will not say that some parameter was not checked for null value, that the contract required Strings to be trim()ed before use or that even though all lines of code were hit, some particular sequence wasn’t tested. Nope. The tests might be meaningless and brittle masking real issues, but who cares, as long as there is coverage.

Let me get this straight again: code coverage is a valuable metric. But when management turns it into a goal and becomes fixated on it to measure quality and performance, things rapidly disintegrate. It might have to do with our psychology or our nature, but we humans optimize our performance according to how we are being measured and become distracted from what really matters. Testing requires thoughtfulness and careful design. Scott Bain, author at Sustainable Test Driven Development explained it better:

If developers are writing unit tests because “the boss says so” then they have no real professional or personal motivation driving the activity. They’re doing it because they have to, not because they want to. Thus, they will put in whatever effort they have to in order to increase their code coverage to the required level and not one bit more. It becomes a “tedious thing I have to do to before I can check in my code, period.”

In his excellent article “How to Misuse Code Coverage”, Brian Marick discovered the same thing. Organization that mandated a code coverage percentage got just the percentage they wanted. Which is scary because it might be a sign that people are gaming the system to meet the target.

Perhaps this might hint at the answer: when I talk about coverage to organizations that use 85%, say, as a shipping gate, I sometimes ask how many people have gotten substantially higher, perhaps 90%. There’s usually a few who have, but everyone else is clustered right around 85%. Are we to believe that those other people just happened to hit 85% and were unable to find any other tests worth writing? Or did they write a first set of tests, take their coverage results, bang away at the program until they got just over 85%, and then heave a sigh of relief at having finished a not-very-fun job?

Even if we take the positive outlook that people will make an honest effort to meet enforced code coverage percentage, it’s still counter-productive to have them obsess over the code coverage as a number rather than focusing on the quality, sufficiency and maintainability of tests.

“But my team has had several bugs escape to production!”

Mandating a code coverage number is not the answer. If anything, you’ll make matters worse. Writing good tests is a skill. As a manager, it is your job is to figure out exactly what is going wrong.

I managed a team that consistently had issues with bugs. Luckily, we were catching these bugs in Q/A but because the development team was offshore and in a different timezone, the feedback loop was becoming a burden. The local software development manager was out of ideas as well. I started my investigation one evening. The first thing I looked at was their code coverage. Not stellar at 76% but not so bad either. I scratched my head and dug into the source code and started looking at the tests. And I didn’t have to dig deeper to find the problem: the tests were crap. I remember a small test that alone got 40% coverage! It was a dirty hybrid of unit and integration test that tested a high level event processing method with the right parameters. It didn’t even check what would happen when one of the parameters is missing or has the wrong value.

In their defense, they had assigned a “junior developer” to write the tests while the “senior guys” wrote actual code. How did we fix it? We didn’t ask the team to increase code coverage, in fact, we didn’t even mention it. We started educating them on how to write good unit tests. We encouraged them to watch videos during office hours etc. After initial resistance and passive-aggressiveness, they saw the light and realized that a lot of errors Q/A were discovering, they could find themselves using proper unit tests techniques. More importantly, they realized that for them to professionally grow, they must learn to write proper tests and write themselves. The bug count went down significantly.

The Google approach

I like the Google approach. They strive for 85% code coverage but it is not “set in stone”. Their code coverage results over a month are shown in the graph below.

Pretty Impressive.

Summary

Managers should expect their team to have high coverage but they must not turn it into a target. Metrics do not make good code. The ultimate goal is to have fewer bugs that escape into the production.

Is it OK to make mistakes at work?

2016-04-14T00:00:00+00:00

I have not failed. I have just found 10,000 ways that won’t work. -Thomas Edison

Software development is an activity that requires serious brain power. And it’s only natural that software developers make mistakes along the way. However, many organizations and managers stigmatize failures. It’s considered a sin to make an error on the job. A sin that is not going to be forgotten any time soon.

We are taught from an early age that mistakes are bad. Students are marked by the number of mistakes and the society looks down on failures. So why should managers allow it? The answer is simple: people will make mistakes in any activity that requires imagination and creativity. That’s how they learn and improve. DeMarco and Lister explained it better:

Fostering an atmosphere that doesn’t allow for error simply makes people defensive. They don’t try things that may turn out badly. You encourage this defensiveness when you try to systematize the process, when you impose rigid methodologies so that staff members are not allowed to make any of the key strategic decisions lest they make them incorrectly. The average level of technology may be modestly improved by any steps you take to inhibit error. The team sociology, however, can suffer grievously.

The last sentence is the key: The team sociology, however, ~~can~~ will suffer grievously. As a software manager, your primary day job is to make it possible for your team to do work. When people become defensive, they lose motivation to do good work. Whenever a ‘bug’ is reported by customer or a client, instead of focusing their efforts to objectively locate the bug, employees spend time and energy on ‘covering their asses’ and collecting data to ‘prove’ that there has to be something “wrong with the other system.”

The biggest fear managers have is that mistakes made will cost their company money or damage customer relationship. While this is true, most mistakes made in the workplace don’t damage company’s reputation. Managers should identify few key customer facing areas where mistakes will be catastrophic and put additional oversight and checks in places. Keep this list very short and allow employees the freedom to make mistakes in other areas.

Another common irrational fear is that by allowing mistakes, people will make ‘stupid’ or ‘repeated’ mistakes. I guess I have have been lucky to have managed very few ‘bad apples’. (I could only think of 2 bad employees in the last 4 years.) Majority of software developers are smart, talented and proud people who value quality work, learning and professional development.

In his autobiography, Against the Odds, James Dyson writes:

I made 5127 prototypes of my vacuum before I got it right. There were 5126 failures. But I learned from each one. That’s how I came up with a solution. So I don’t mind failure. I’ve always thought that schoolchildren should be marked by the number of failures they’ve had. The child who tries strange things and experiences lots of failures to get there is probably more creative… We’re taught to do things the right way. But if you want to discover something that other people haven’t, you need to do things the wrong way.

I couldn’t agree more. Mistakes are part of a healthy team culture. Unless the job is dead simple and requires no imagination and creativity, mistakes will be made. I made a mistake in 2012 that cost my company as much as $5,000. A few ‘billable’ events were lost due to a Redis library issue. I had just convinced my bosses to let me replace the old MySQL database with shinny new Redis. I was devastated. We found the bug after 3 sleepless nights. I was very angry at myself for screwing up. But my boss shrugged it off, thanked me sincerely for fixing the problem and recognized my efforts publicly. Quoting someone, he said to me: “If you’re not making mistakes, then you’re not trying anything new.”

Mistakes should be embraced and celebrated. Software development is a difficult activity and software developers will make mistakes. They shouldn’t be crucified for making honest mistakes. It’s part of the learning. Great managers don’t cheat their employees of personal growth and development opportunities.

Generating Sessions Ids

2016-04-13T00:00:00+00:00

Session Id’s are unique, short-lived numbers that servers assign to users when they log in (or visit) so they can remember (or track) users for the duration of their sessions. Servers use session Id’s to remember users because the underlying protocol, HTTP, is stateless. Once they receive session Id from the server, users send it back in the following requests to identify themselves. For example, when you login to a website, the server assigns you a session Id and sends it to your browser wrapped in a cookie. The browser automatically sends the cookie back in the subsequent requests so the server knows who is making the request.

Almost all web frameworks I have worked with have built-in support for sessions: they generate and assign Id’s under the hood. The only time I had to generate session Id’s manually was when I was building a REST application (game service) that needed a custom way to identify users and sessions. This blog post is the result of research I had to do to build that feature. I would highly recommend not rolling out your custom session handling code, unless you absolutely have to.

Session Ids are unique, transient and non-guessable

Session Id’s must be unique across all users. Can you imagine two people getting assigned the same Social Security number? That would be a disaster.
Session Id’s have ‘best-by’ date and they timeout after a certain period. If they didn’t, a hacker could steal and use them indefinitely. Generally the expiry period ranges from minutes to weeks. High-risk applications expire session Ids more frequently than the low risk ones to minimize the attack window.
Session Id’s are not guessable. A bad example would be an algorithm that generates sequential session Id’s. Hackers can easily identify patterns and hijack user sessions.

You can generate and assign session Id’s to users in many different ways. I’ll discuss three common methods below.

1. Random session Ids

Random session Id’s have no meaning by virtue of being completely random. The server sends them to the client and stores them in a database along with the the user information.

Session Id	User Id	Expiry Time
fb2e77d.47a0479900504cb3ab4a1f626d174d2d	jimHalpert1	15 minutes

If session Id’s are random numbers, how do we ensure that they cannot be guessed or predicted by hackers? In Cryptography theory, entropy is the measure of uncertainty associated with a random number. Session Id’s should have very high entropy to protect against attacks. OWASP suggests at least 64 bits of entropy. Sounds complex? (it did to me.) The good news is that you and I should never have to worry about writing your own algorithms (Don’t even think about it - random number generation is very complex). Most languages have pseudo random numbers generators (PRNGs) that generate ‘cryptographically secure’ random numbers that have entropy. As an example, Tomcat uses SHA1PRNG to generate a random number and hash it with MD5 (see warning below) to create session Id’s. Here’s a link to the source code (There’s a list of PRNGs for other languages near the end of this post.)

Warning: Do NOT use MD5 to generate session Id’s because it is considered insecure. In the application I was building, I used SHA-2 (SHA-256).

2. Session Id’s using shared secret

These types of session Id’s are created in such a way that the information needed to identify a user is embedded into the session Id itself. Since session Id’s are self-contained, the server doesn’t need to store them. Let’s look at a trivial algorithm that generates session Id’s by combining username, IP address and a client secret:

sessionId = SHA2(username + ipAddress + secretKey /* or salt */)

When a request arrives, it contains the username and IP address is automatically recorded. The server then uses the username, the IP address and secret key to re-generate the session Id and see if it matches with the session Id passed by the client. If it does, the verification is successful.

Note: If you use IP address to calculate session Id’s, keep in mind that the session Id will be invalidated when the IP address changes. This happens very frequently if your users are on a mobile network and are moving. If you are not sure, don’t use the IP address.

The advantage of this method is that the server doesn’t have to maintain state and store session Id’s in a database.

Disclaimer: The Session Id generation formula above is simplistic. Real applications would combine many parameters such as the user’s access group, timestamp, etc. The timestamp is generated based on Session Id’s lifetime to allow it to expire.

Note: There is a cool standard called JSON Web Tokens that allows the payload to carry the information. I haven’t used it but it looks promising.

3. Random session Id’s with a predictable part

This is a slight modification of the Random session Id generation method. The session Id consists of both a random number and a hash combining some properties of the user such as the username and IP address.

sessionId = SHA2(userId + ipAddr) + prngRandomNumber

The resulting session Id is stored in the session store and looked up for each request. I feel this is a little more secure than just using a (Cryptographically secure) random number. Over engineered? May be. But I’ll err on the side of caution.

Security

Before I end this article, let’s briefly discuss security. Because session Id’s are usually portable, as a developer, you need to ensure that they are not easily obtainable by eavesdroppers or can be shared by mistake.

Don’t send session Id’s unencrypted. Use HTTPS to encrypt all traffic end-to-end.
Don’t send session Id’s a URL parameter: your users can inadvertently share URL’s thus revealing their session Id. Also, the session Id’s will appear in the web server or application logs and will be visible to anyone who has access to logs.

Bonus

Here is a small list of cryptographically secure number generators in popular languages:

Update: 3/11/2017 - Removed references to MD5 thanks to HollyGraceful for calling it out.

Developing Sense of Ownership in Employees - Let Your People 'Own' It!

2016-04-12T00:00:00+00:00

To me, management is:

about hiring the right people
telling them what needs to get done and why,
giving them the tools they need and
getting out of their way.

This might sound extreme. I don’t mean managers should leave their employees on their own without any supervision or accountability. Not at all.

Even the brightest people need a support from time to time to put them back on the right track.

Managers are responsible for making sure their teams understand the vision and stay on track. For inviting them to in the goal-setting process and listening to their feedback. For challenging their employees to perform at their best. For holding the team accountable for their actions.

The problem starts when managers get in the way of their employees and interfere with their ability to perform their duties and later wonder why it is so hard to retain employees. They don’t let a sense of ownership developer in their employees because of their inability to delegate and thinking they are responsible for all decision making.

Almost all organizational structures put managers at the top of the hierarchy implying that they are the ones responsible for all decision making. And most managers behave that way. DeMarco and Lister provide details of their encounter with one such “manager”:

one senior manager we encountered at a professional society meeting in London. He summed up his entire view of the subject with this statement: “Management is kicking ass.” This equates to the view that managers provide all the thinking and the people underneath them just carry out their bidding. Again, that might be workable for cheeseburger production, but not for any effort for which people do the work with their heads rather than their hands. Everyone in such an environment has got to have the brain in gear. You may be able to kick people to make them active, but not to make them creative, inventive, and thoughtful.

Micromanagement doesn’t work and is not productive. Get the right people on the bus and let them take ownership. Check-in regularly to make sure everyone understand where the ship is headed. Remove barriers and move heaven and earth to give the team anything they need to succeed.

Bad managers who believe their job is to “be an intimidating alpha male and to crack the whip” will see the productivity and quality go down in any job that requires people to use their imagination and brain power. Then they become defensive and start putting “processes” and “measurements” in place. The slow decline begins. People abandon ownership or adopt the “Meh. Why should I care?” attitude. Good people leave (physically or mentally) and they end up with yes men. The project is doomed. If it doesn’t fail, it will be of mediocre quality.

In Peopleware, DeMarco and Lister make a great point:

You take no steps to defend yourself from the people you’ve put into positions of trust. And all the people under you are in positions of trust. A person you can’t trust with any autonomy is of no use to you.

Good managers understand that the path to success starts by hiring the right people and developing a sense of ownership in them.

What Are Code Reviews and How to Do Them Effectively

2016-04-03T00:00:00+00:00

What is Code Review?

Code review is the process or rather an activity in which code written by a developer is inspected by other developers to look for defects and improvements. In other words, developers work on their code and ask for one of their peers to review their changes before they are merged into the main codebase.

In the last few years, code reviews have become part of normal workflow for large and small teams, ensuring that every change gets looked by at by least one other person. They are an integral part of almost every large company like Microsoft, Google, and Amazon to name a few, where every line of code is reviewed and approved by developers before it is merged into the main codebase.

What To Look For in Code Reviews

It varies from team to team, but generally, a code reviewer should consider the following:

Does the code meet the requirements that it’s addressing? Does it do what the developer intended it to do?
Does the overall design makes sense and fit with the rest of the architecture?
Are there general defects like race conditions (for concurrent code), edge cases, and other bugs that users might encounter?
Is the code readable and maintainable? Are future developers likely to struggle to understand what’s going on? Is it more complex than it needs to be?
Does the code have appropriate Test coverage? (Unit, Integration, or End to End tests)
Does the code adhere to your coding standards or style? This includes things like naming, comments, etc. Note: Automate style-checking as much as possible.

Why do Code Reviews?

Code reviews take time: the change is held up until it is reviewed by usually 2 people. So why do it?

When done correctly, code review is a proven technique that helps improve code quality and spread core knowledge across the team. It forces developers to hold themselves to a higher standard because they know that their code will be reviewed by their peers. It’s also a great tool for mentoring new team members on nuances of the code base. Or as Martin Fowler puts it:

Code reviews help spread knowledge through a development team. Reviews help more experienced developers pass knowledge to less experienced people. They help more people understand more aspects of a large software system. They are also very important in writing clear code. My code may look clear to me, but not to my team. That’s inevitable–it’s very hard for people to put themselves in the shoes of someone unfamiliar with the things they are working on.

So even though code reviews introduce extra time, it’s an excellent trade-off considering all the benefits that we get out of them. (You could even argue that they save time that’d be spent in future on bug fixes, direct knowledge sharing or paying technical debt as a result of unmaintainable code.) According to Code Complete 2, code reviews are very effective at detecting bugs and cites several case studies:

IBM’s 500,000 line Orbit project used 11 levels of inspections. It was delivered early and had only about 1 percent of the errors that would normally be expected.

A study of an organization at AT&T with more than 200 people reported a 14 percent increase in productivity and a 90 percent decrease in defects after the organization introduced reviews.

Jet Propulsion Laboratories estimates that it saves about $25,000 per inspection by finding and fixing defects at an early stage.

However, many teams struggle with effective code reviews. In fact, when done incorrectly, code reviews can be quite painful. In dysfunctional teams and organizations, it can quickly turn into a rather nasty experience for everyone involved:

Code reviewers show off their skills - or sometime even get back at the author - by demanding that their pointless opinions be implemented, that would make absolutely no difference on the outcome.
Code reviews take a long time to complete introducing delays to feature releases and become annoying for the author to keep resolving merge conflicts.
Authors ignore review comments and get their ally to approve.
It creates friction between developers.

How to do Code Reviews

Let’s look at some techniques that the authors of the pull request, reviewers and company management can use to do code reviews effectively.

Advice to Management

Managers should make sure that everyone understands the goals and importance of code reviews. Unless code reviews are part of your culture, developers are not going to ask their peers to review their code. Make sure developers get enough time in their Sprints for code reviews.

Managers can also help by setting up the right tools and adapting the release workflow to make code reviews a mandatory activity. Source control management systems like GitHub and Gitlab have built-in support for code reviewers that allows comments on specific lines of code, blocking merge until required number of approvals have been granted, etc. You can also use 3rd party tools like Crucible from Atlassian.

Make sure developers have enough bandwidth in their Sprints to review code. Otherwise, code reviews can end up taking a long time to complete.

Everyone: Remember the Human

In his book, Peer Reviews in Software: A Practical Guide, Wiegers writes:

The dynamics between the work product’s author and its reviewers are critical. The author must trust and respect the reviewers enough to be receptive to their comments. Similarly, the reviewers must show respect for the author’s talent and hard work. Reviewers should thoughtfully select the words they use to raise an issue, focusing on what they observed about the product. Saying, “I didn’t see where these variables were initialized” is likely to elicit a constructive response, whereas “You didn’t initialize these variables” might get the author’s hackles up.

It is easy to become fixated on the code, but remember, there’s a human at the other end of the table (or computer). A human who has opinions. A human who is entitled to have an ego. Remember that there are many ways to solve a problem.

Be humble. I have seen both highly productive reviews and very unproductive ones because someone decided to be a prick – don’t be a prick:-)
Make sure you have coding standards in place. Coding standards are a shared set of guidelines in an organization with buy-in from everyone. If you don’t have coding standards, then don’t let the discussion turn into a pissing contest over coding styles (opening braces ‘{‘ on the same line or the next!) If you run into a situation like that, take the discussion offline to your coding standards forum.
Learn to communicate well. You must be able to clearly express your ideas and reasons.
When it comes to dealing with opinions, reviewers and authors should seek to understand each other’s perspectives but shouldn’t get into a philosophical debate.

Advice to Reviewers

The author isn’t there to be a sitting duck. Remember the purpose is to not demonstrate who the better programmer is: it is finding defects and ensuring that the code is simple and maintainable (or whatever your objectives are.)
Leave actionable feedback.
Ask questions when you are not sure about something. Don’t make demands or statements which could sound accusatory. For example, don’t say: “You didn’t use the XYZ library here”. A better way would be to genuinely seek to understand the developer’s perspective: “What do you think about library XYZ and if it applies here?”.
Avoid “why did you”, “why did you not” style questions when possible. It could put people on the defensive. “Why did you make this a global variable?” could be better expressed as “I don’t understand why this needs to be a global variable. Can you explain?”.
Use we instead of you.
If you are providing a suggestion, call it out as such. For example, instead of saying: “make the color one shade more neutral”, you should say: “(suggestion/opinion) we should make the color one shade more neutral”.
Review quickly and block time for code reviews.
If the author addressed your comments or did something great that you noticed, tell them! Most code reviews focus on finding mistakes but you should offer your appreciation and encouragement if the developer did something great. This type of feedback is super encouraging to developers and goes a long way!

Some of these things won’t work if they come off as rehearsed or said in a sarcastic tone. Treat the code review as you would a normal conversation. You are listening to another person and should genuinely seek to understand their perspective. Offer suggestions and tips when they are necessary. If the code is great, don’t be compelled to find something negative to say about it.

Advice to Authors

Don’t take things personally. Remember that the villains are the defects or inadequacies in the code, not you. Recognize that you may be attached to your code and that it is normal. If you take pride in your work, that’s a good sign that you are someone who cares about the craft. Have just the right amount of ego – enough to trust and defend your ideas, that is the ability to negotiate.
Don’t create mega-gigantic pull requests (10+ files changed.) Ask for reviews often and keep your pull requests small. If you can’t avoid this, give heads up to others (ahead of time such as during Sprint Planning) and set up some time to describe your changes and requirements.
Describe your change: provide a complete description and link to the JIRA ticket so that reviewers can understand the requirements.
Add the right reviewers. You want to get the best feedback so it is important that you add the right people. For example, if you are making a change in the React code, make sure you add people who are not only familiar with not only React, but also with that part of the code. If you are making a change in a microservice that’s owned by a different team, add them as reviewers. On instances, you might even need to assign different people to different parts of the PR.
To err is human. The reviewer is acting as a second set of eyes and could point out things that you might have overlooked. Questions are as valuable as concrete advice.
Ask specific questions. “Does it make more sense to move all these classes into their own package?”
Respond to all feedback, whether you agree or disagree.

Code Review Example

Modern tools like GitHub, GitLab, Crucible and others have made the code review process easier than ever. Let’s look at a good code review example I found on GitHub. The developer made some changes and opened a pull request (PR). The first thing to notice is the detailed and to the point commit message by the author describing the change:

Next, we see that the code reviewer suggests some changes, and explicitly calls out “good ideas” to encourage the contributor.

The reviewer then approves the PR and leaves this final message:

It’s great to see this fixed, thank you! Nice explanation + commit message too.

Another improvement we’ve considered is to debounce the scrollback management #8959 (or ideally put it in a ring buffer, but I’m not sure if that can work with libvterm).

Summary

Code review is an excellent technique for improving software quality when done right. Code reviews not only find defects but also helps developers grow by getting them feedback on their code from other developers, spreading code knowledge throughout and are a useful tool for mentoring new team members. Code reviews are part of a healthy engineering culture.

Updated: June 2022 to add Code Review example.

I would love to hear your feedback, comments, thoughts on conducting effective code reviews. Please leave a comment below sharing your experience or anything that would add value to this article and its future readers.

Checked vs Unchecked Exceptions in Java. Why it's so Confusing

2016-04-02T00:00:00+00:00

This blog post is intended for new Java developers. It starts with a historical perspective and a look at what motivated the design and creation of Java’s exception handling mechanism. It also explores the hotly debated checked vs unchecked exceptions debate with some personal insights.

Let’s start.

Historical Perspective

Back in the time of the “C” programming language, it was customary to return values such as -1 or NULL from functions to indicate errors. This was practical for small applications but didn’t scale well for larger applications - developers had to check and track every possible return value: a return value of 2 might indicate ”host is down” error in library A whereas in library B, it could mean ”illegal filename”. Although, developers tried to fix this and attempts were made to standardize error codes by setting global variables, but it didn’t help much.

James Gosling and other designers felt that a similar approach would go against the design goals of Java programming language. They wanted:

a cleaner, robust and portable approach
built-in language support for error checking and handling.

Essentially, one of their main design goals was to build a language that’s robust and able to cope with errors during execution or at least recognize when going go wrong. This principle is at the core of Java’s error handling design, as we’ll see later. James Gosling explains in one of his interviews:

One of the traditional things to screw up in C code is opening a data file to read. It’s semi-traditional in the C world to not check the return code, because you just know the file is there, right? So you just open the file and you read it. But someday months from now when your program is in deployment, some system administrator reconfigures files, and the file ends up in the wrong place. Your program goes to open the file. It’s not there, and the open call returns you an error code that you never check. You take this file descriptor and slap it into your file descriptor variable. The value happens to be -1, which isn’t very useful as a file descriptor, but it’s still an integer, right? So you’re still happily calling reads. And as far as you can tell, the world is all rosy, except the data just isn’t there.

They didn’t have to look too far. The inspiration for handling errors came from a very popular language of the 60’s: LISP. Java’s exception handling was born and (for better or worse) the rest is history.

Exception Handling in Java

So what is exception handling? It is unconventional but a simple concept: if an error is encountered during program execution, halt the normal execution and transfer control to a section specified by the programmer. Let’s look at an example:

try {
   f = new File("list.txt"); // throw error if file is not found...
   f.readLine;
   f.write("another item for the list");
   f.close();
} catch (FileNotFoundException fnfe) { // transfer control to this block on error.
   // do something with the error.
   // notify user or try reading another location, etc
}

In other words, exceptions are exceptional conditions which disrupt the normal program flow. Instead of executing the next instruction in the sequence, the control is transferred to the Java Virtual Machine (JVM) which tries to find an appropriate exception handler in the program and transfer control to it (hence disrupting the normal program flow). In the last example, if the file list.txt is not found, the control will be transferred to the catch(...) block instead of continuing to the next line in the try block. Here are some more examples where an exception can be thrown:

Accessing index outside the bounds of an array
Disk is full
IP address for a host couldn’t be determined
Using a null value when an object is required
Divide by 0
Violation of defined contract: e.g. invalid values passed to a method

Two Types of Exceptions in Java: Checked and Unchecked

In Java, there are two types of exceptions: checked and unchecked. Let’s take a look at them.

Checked Exceptions

Checked exceptions are used to represent recoverable error conditions e.g. file not found. Java requires that these exceptions are explicitly handled by developers or the code won’t compile. According to official documentation:

These are exceptional conditions that a well-written application should anticipate and recover from. For example, suppose an application prompts a user for an input file name, [..] But sometimes the user supplies the name of a nonexistent file, and the constructor throws java.io.FileNotFoundException. A well-written program will catch this exception and notify the user of the mistake, possibly prompting for a corrected file name.

Omitting to catch a checked exception will result in a compile time error:

Main.java:8: Warning: Exception java.io.FileNotFoundException must be caught, 
or it must be declared in throws clause of this method.
        f = new FileInputStream(filename);
            ^

Developers have a few options on how to handle checked exceptions. All of these require an explicit acknowledgement of the exception and taking an action on it.

catch and do something with the exception, or,
re-throw it and give the methods higher up in a call trace a chance to handle it, or,
catch the exception, wrap it up in a different exception and throw it up. This is called exception wrapping.

Since checked exceptions represents problems from which a program may wish to recover, the designers of Java wanted to force developers to at least pay attention so that these do not get accidentally ignored. (and are handled as close to the source as possibe.) As James Gosling explained:

… because the knowledge of the situation is always fairly localized. When you try to open a file and it’s not there, you’re coping strategy is really determined by what you were going for. Some guy miles away isn’t going to know what to do. The code that tried to open the file knows what to do, whether it be trying a backup file, looking in a different directory, or asking the user for another filename.

You can’t accidentally ignore a checked exception. If you do choose to ignore, you must do so explicitly.

try {
   Set set = ...
   // code which throws checked exceptions
} catch (Exception e) {
   // do nothing or "I don't care". Extremely bad practice.
}

Beware: Do NOT do this in real-life. It’s an extremely bad practice. At the very least, log the exception.

Unchecked exceptions

The documentation says:

(unchecked exceptions) are exceptional conditions that are internal to the application, and that the application usually cannot anticipate or recover from. These usually indicate programming bugs, such as logic errors or improper use of an API.

Unchecked exceptions (aka RuntimeExceptions) represent problems which happen during program execution e.g. divide by 0, accessing object method on null object reference, etc. Unlike checked exception, Java doesn’t require that we catch unchecked exceptions and the compiler won’t complain. These can happen anywhere in your program and our code will be littered with try-catch if we were catching every RuntimeException. In real life this would be equivalent of putting diesel into a gasoline car. You broke the contract and shouldn’t have done it. A mistake was made. The car comes to a grinding halt. There’s nothing you can do to fix it and get to your destination. You must to take the car to a dealer.

Should you catch RuntimeExceptions?

What’s the point of catching RuntimeExceptions if the condition is irrecoverable? These errors are usually preventable by fixing your code in the first place. For example, dividing a number by 0 generates RuntimeException (ArithmeticException.) But you could avoid it by checking that the argument is greater than zero i.e. denominator > 0. If this condition is not true, halt further execution. (and possibly throw IllegalArgumentException!)

But there are situations when it’s alright to catch RuntimeExceptions, log the error and move on. A while ago, I worked on a high-throughput client that was processing thousands of transactions a second. If a transaction was malformed, the code will complain and throw RuntimeException. To prevent that from happening, we were checking every single transaction to catch and ignore any malformed transactions.

boolean isMalformed(Transaction t) {
  // Check transaction and return false if it's malformed; true otherwise
}

When we looked at the metrics, it showed that malformed transactions were super rare. 99.99% of the transactions were good, yet we were wasting precious CPU cycles testing every single transaction (including String comparison on some of the fields). So we removed the isMalformed(...) method and let the code throw RuntimeException, log it and move on to the next transaction. The result was improved application performance.

Checked vs Unchecked Exceptions

Many people find the dichotomy between checked and unchecked exceptions confusing and counter-intuitive. The core argument is whether or not a language should force developers to catch exceptions. Other languages like C++ and the more modern C#, left out checked exceptions and only support unchecked exceptions. If you want to read up on the debate, you can visit this Stack Overflow question.

Both sides have equally compelling arguments. Personally, I use checked exception when coding in Java but only judiciously and where it makes sense. I use them so that callers of my code don’t accidentally ignore something which they shouldn’t. And this is how it’s supposed to be done. I have worked with code where developers who didn’t like checked exceptions were disguising them as unchecked (creating all exception subclasses from java.lang.RuntimeException.) This is dangerous because some users of their code or library (other Java developers) who are expecting to only handle checked exceptions may not pay attention to unchecked and ignore things which they shouldn’t. On the other extreme, there are developers who just don’t understand when to use which type and force others to catch ‘checked’ exceptions which they cannot recover from, causing pain and displeasure. A developer I knew would Google search for similar built-in exception types and blindly use or extend from what he found without stopping and thinking whether he should be using checked or unchecked. And that I feel is the root cause behind the confusion and the debate: checked exceptions aren’t bad. People just don’t pay attention and choose unwittingly.

If you are in the camp of developers who don’t like checked exceptions and wish they go away, you should know that it’s not going to happen any time soon. Oracle is still promoting the use of checked exceptions as is evident from any Java documentation. Joshua Bloch also supports the use of checked exceptions in his seminal book Effective Java:

Item 40: Use checked exceptions for recoverable conditions and runtime exceptions for programming errors. E.g. host is down, throw checked exception so the caller can either pass in a different host address or move on to something else.

Item 41: Avoid unnecessary use of checked exceptions.

Good or bad, checked exceptions are here to stay and and while they are not perfect or even good, we should use them as intended when programming in Java. Enough said. Let’s move on take a look at the exception class hierarchy in Java.

The Java Exception Class Hierarchy

All exceptions in Java have a common ancestor: java.lang.Throwable. The following base exception classes are most commonly used by Java developers:

1. `java.lang.Exception` and the `java.lang.RuntimeException`

Any class extending from java.lang.Exception is classified as checked exception and must be declared in a method’s throw clause. Likewise, any class extending from java.lang.RuntimeException is classified as unchecked exception.

Confusingly enough, java.lang.RuntimeException is a child of java.lang.Exception but it is classified as unchecked exception. (Special case.)

2. `java.lang.Error`

Error and its subclasses are the second category of unchecked exceptions (the first one being the RuntimeException). These are used to indicates serious or abnormal problems e.g. disk failing while your application is in the process of writing to it or VirtualMachineError etc. Your application cannot recover from these errors and for practical purposes, you’d never have to catch or worry about this type.

Here’s a look at the Java Exception class hierarchy visually:

That’s all folks! Hope you found this useful.

Here’s a more recent talk on the subject by Elliotte Rusty Harold, author of several Java books. I think he completely nails the issue that checked exceptions are not bad, it is just that “Sun forgot to tell anybody how to use them!”.

Semantic Search - Word Embeddings with OpenAI

Lexical Search Engines

Semantic Search

Word Embeddings

Word Embeddings Complete Example on Github

(GitHub - Example) Word Embeddings and Semantic Search using OpenAI

Brief Overview of Caching and Cache Invalidation

What is a Cache?

Why do we need Caching?

Caching Example

Cache Invalidation

When to Not Use a Cache

Caching Strategies

How To Manage Employees Who Are Going Through a Difficult Period

1. Don’t Wait - Early Detection Goes a Long Way

2. Understand the Situation

3. Don’t Solve Solo. Brainstorm With Your Employee

4. Make a Plan

5. Share the Arrangement With the Team

6. Follow up

Summary

Code Reviews During Emergencies

Emergencies

Not Emergencies

Code Review Protocol During Emergencies

1. PR Authors: Keep the Change Small and Focused

2. Reviewers: Complete Review as Fast as Possible

3. Reviewers: Limit Code Review Focus

4. Everyone: Follow-up With More Through Code Review

Summary

Burnout in Software Development - Survey Results 2021

What is burnout?

What causes burnout?

Developer Survey Results

82% of all developers indicated that they have experienced burnout in last 6 to 8 months

73% developers said burnout is negatively impacting their productivity or personal life

Developers indicated increased workload and and poor work culture as the main reason

57% developers indicated that COVID-19 has made the situation worse

77% of the developers indicated that their management is not aware of the burnout or not taking any steps to help its employees manage it

78% of the developers said they are planning to switch their job within the next 12 months

Other comments

If you enjoyed this post, please sign up for Unlaunch, which lets you release your code whenever you want and change its behavior on the fly using Feature Flags.

How to use Feature Flags in Node.js

1. Put your feature behind feature flag in code

2. Create a feature flag

Targeting Users

Targeting Rules

3. Integrate Unlaunch Node.js SDK in your app

4. Call the feature flag and use variation to show or hide the feature

Summary

GitHub Repo

Learn More

How to Toggle Features in C# with Feature Flags

Feature Flags

Basics of Feature Flags

Implementing Feature Flags in .NET Core

Before Your Get Started

Register an Unlaunch Account

Access the Feature Flag from .NET Core Application

Install the Unlaunch SDK

Initialize Unlaunch Client as a Singleton

Evaluate feature flag and get “on” or “off” variation

Targeting User Segments with Feature Flags

Targeting Specific Users By ID

Targeting by Geo-location

Targeting by Email

Summary

GitHub Repo

Learn More

The Complete Guide to Feature Flags

What is a feature flag?

How feature flags work?

Anatomy of a feature flag

Evaluator

History of Feature Flags

What’s in a name? That which we call a rose…

Feature Flag Use Cases

Canary Releases and Gradual Roll outs: Rapid Releases at Scale

Alpha testing and dogfooding

Continuous Delivery

The Difference between `==` and `===`

Is `===` Faster than `==`? A Quick Look at the Performance of the Two Operators

Inequality Operators: `!=` and `!==`