- Stories about lessons learned building matchmaking, leaderboards, and low latency large systems for Activision/Blizzard
- "Empathy is a fundamental engineering skill"
- Can't find all the bugs
- #hugops
- slides from mattsmillie on Twitter
- To Read: bit.ly/AllspawThesis
- Plus https://hfeinpractice.wordpress.com/2016/04/01/chapter-25-human-factors-and-ergonomics-practice-in-web-engineering-and-operations-navigating-a-critical-yet-opaque-sea-of-automation/
- Plus read Google's post mortem of their recent outage: https://status.cloud.google.com/incident/compute/16007?post-mortem
- A really good one on Mobify building up their Analytics BI tools using RDS.
- "Agile design is not intelligent design" but it got them a powerful service quickly, but noone in their right mind would ever design this. ;)
- To Apply, Brendan Gregg's USE method when debugging: https://twitter.com/AlexJHammel/status/721033862453682176
Lunch
- ideas for implementing positive change in your org
- It doesn't know your existing infrastructure
- Ruby gem
terraforming
can read your infrastructure likeansible-blueprint
- Built "locking" via Jenkins so that pieces have a state. Trying to DIY Atlas to stay cheap.
- Popular docker build service recommendations
- Build kite
- Circle CI
- ^^ build services.
- tangent: ejson for shopify secrets
Day 2
- on sight visits are good.
- faces are important
- people are made to be empathetic
- Slides: https://twitter.com/davemangot/status/721434173332791297
- Random link: http://continuousdelivery.com/2016/04/the-flaw-at-the-heart-of-bimodal-it/
- awesome pictures
- docker, pytest
- unit tests
- system tests find bugs but are way complex heavy to setup
-
- keep provisioning fast to keep quality high => more iterations more better
- Slides: https://docs.google.com/presentation/d/1RoxFXCOOnXOZZzOivV7mqTs5-B-pxgKCkvX2flianvc/edit#slide=id.p
- github.com/keeppythonweird/pytest-dockerpy
- "Tests increase confidence, test failures must be informative and actionable or they get ignored" may have been said on this talk, or the next one.
- "Failure Friday fosters a culture of handling failure" at PagerDuty
- "Running a service through the Failure Friday gauntlet is a great addition to the release process for new services"
Overview:
- improved resilience by testing in production
- reliability. Pager duty can't go down. Because they're important backup, so very resilient.
- canary deploy
- bad deploys causing outages
- postmortem noticed
- they use Go CD by thought works
- Do "end to end provider testing"
They use twilio, plivo tropo Weighted selection of SMS
They test production. - wrote 100 tests/watchdogs/health checks in production - short tests every 5 - long tests every 30 -examples found: - API compatibly broken - slow queue - LB breakage http - transient failures - tests make load
Failure Friday. Inject failures into the system Gather the company Kinda like a hackathon Watch the data dog dashboard Stop the service Restarting hosts Network isolate with IP tables Tc qdisc add eth0. --- make fake latency #FF chat channel Log commands run and what happened. Track TODOs then later fix Post graphs into logs
- Terraform plan
- run continuous to see state drift events
- Run load tests often
- built a Lamda function from CloudFormation that makes CloudFormations
- keep it small, keep things fast
- They make GoCI
- Wrote some books on it
- Wanted to look at alternatives to their own tools
- Evaluated a large batch and settled on Travis vs GoCI
- History of people burned by past experience with CI like Jenkins
- newness is hotness, but shiny ain't always a good thing
Lunch
- dream: what the future of connected devices might look like
- IOT, embracing devops culture, and deployment toolchains
- single page app optimization
- "Good point about frameworks being heavy b/c they have something for everyone." @mtomwing
- talked with folks experimenting with deployment and monitoring in their chatrooms
- Hootsuite, Shopify, Samsung