Brainiac Corner with Charity Majors

Posted by VividCortex on Feb 12, 2015 6:15:00 AM

The Brainiac Corner is a format where we talk with some of the smartest minds in the system, database, devops, and IT world. If you have opinions on pirates, or anything else related, please don’t hesitate to contact us. Today, we interview Charity Majors, the production engineering manager for Parse at Facebook. Prior to the acquisition, she built Parse’s infrastructure. Shes hates/loves/hates databases and loves whiskey.


How did you get from stork to brainiac (i.e. what do you do today and how did you get there)?

I manage a team of production engineers working on Parse at Facebook. We are responsible for the performance, reliability, scalability, and database operations for 500,000+ mobile apps.

I started out doing classical piano performance, but I fell in love with computers in college. Since then I’ve worked all over the stack – as an ops geek, code monkey, DBA, release engineer, etc. I spent several years working on Second Life, then experimented with a few startups before landing at Parse. I’ve spent the last three years working on Parse, and I’m really passionate about our product and solving these problems for mobile developers at scale.

Pretty much the only thing I’ve never done much of is networking. My goal for 2015 is to level up on IPv6.

What is in your group’s technology stack?

We are at the tail end of a massive 1.5+ year project to rewrite our entire stack from Ruby on Rails to Go. Yo – if you’re doing a startup and you think you might ever have a scaling problem, DO NOT USE RUBY. Moving from a fixed pool of ruby HTTP workers to a thread-per-request model has improved our reliability by over an order of magnitude.

Besides that: we run on AWS, and we make extensive use of automation helpers like AutoScaling groups. Our application data is stored in MongoDB, our analytics run on Cassandra, and our push stack runs on Redis. We also use MySQL for certain types of developer data.

Our ops team is fueled by a reliable, redundant supply of top-shelf bourbon and single malt scotch.

Who would win in a fight between ninjas and pirates? Why?

I have no vested interest in the outcome of this battle, but I will happily sponsor the fight and set up an online betting exchange for all the nerds who really care. I’ll take 10% off the top and retire to Mexico.

Which is a more accurate state of the world, #monitoringsucks or #monitoringlove?

Uh, very few people actually love monitoring. Those people are precious beyond measure. The thing about monitoring is, the first 85% is easy, and the last 15% is asymptotically hard approaching impossible.

One of the most interesting things about being acquired by Facebook was getting instant access to the wealth of services they have for instrumentation, application counters, time series dbs, and complex visualization tools. We never could have built these ourselves as a startup because we didn’t have the resources. But these tools were transformational for us in understanding and achieving our reliability goals and tracing the performance as experienced by each individual customer app.

This makes me bullish on third party monitoring services like VividCortex and NewRelic. The startup world needs the ability to introspect deeply into performance and reliability just as much as the big companies do, but building that introspection in-house takes key engineering resources away from your core product.

In six words or less, what are the biggest challenges your organization faces?

Making backends magical for mobile developers.

What’s the best piece of advice you’ve ever received?

Just get started. Every big problem seems intractable until you start poking at it.

Also, run towards the terrifying parts. I love that shaky, nervous feeling of not knowing if I’m up to the challenge or capable of solving the problem set in front of me.

What principles guide your expertise in your given domain?

It’s so hard to boil good technical judgment down to a simple set of principles. Every decision is a calculation of resource allocation based on soooo many contextual variables and it mostly surfaces as intuition.

That said, there are a few principles that generally apply to running systems at scale. You have to treat your infrastructure as code, and design for failure. Everything you build will fail! Your systems should be as simple as possible and your solutions should be as reusable as possible. It’s better to have to maintain one solution for ten different problems than ten special snowflakey solutions, even if the one solution is only 80% as good as the optimized case.

You should never have to log in to a server (or VM or container). Any time you have to log in to a server, you have failed in some way.

What is your vision for system administration in 5 years?

In five years, I think software engineers will be getting paged more than operations engineers. “Systems administration” won’t exist in five years – it barely exists now. Developers need to keep getting better at doing traditional operational tasks like deploying their own code, developing their own monitoring and metrics, and owning their services from end to end. Developer happiness should be intimately tied to the health of their services.

Operations engineering is going to keep evolving as a specialized skill set related to building and running systems with high scalability and reliability needs. Ops will increasingly be focused on building fault-tolerant infrastructure for developer-owned services to run on, and helping developers architect reliability into their design.

Also? I think that over the next five years, most people will realize that they actually don’t have particularly hard scaling or reliability problems, and they should outsource their systems to platforms like Parse or Heroku until they do.

Recent Posts

Posts by Topic

see all