- Form a mental model of the process taking place. Write it down in words or draw a diagram. Validate that each step in your model is accurate, not just the one you think is failing.
- Another way to approach the above point - think of all the things that affect the running code, and work out which are the most likely to have an effect. Don't rule out the unexpected ones though. Think how you could test your understanding of, or rule out a problem with, each of the aspects.
- your code (and different versions thereof)
- other libraries (and different versions thereof)
- application server (and different versions thereof)
- language interpreter, compiler and/or virtual machine (and different versions thereof)
- properties files
- environment variables
- registry settings
- native libraries on your operating system (and different versions thereof)
- your operating system (and different versions thereof)
- your file system
- your terminal or shell
- your network
- your container (and different versions thereof)
- caches
- time/timezones
- user privileges
- file permissions
- external dependency state (eg. database)
- Create the shortest feedback loop you can to replicate the issue. Got an SSL configuration issue in your app that takes a minute to start up each time? Try creating a "Hello world" app that boots in a few seconds and try to replicate the issue on that. If you can't recreate the issue in the mini environment, then at the very least, you've ruled out some potential sources of error.
- Creating a "shortest loop" example may involve creating a new codebase/project from scratch, which can be good if you want to share it in a git repo to get help or raise an issue. It also helps rule out interference from other dependencies or environmental issues. You could google for a project generator or example codebase to start with.
- Start with a working example (eg. copied from the internet, or from another project, or from a generated artifact/template or from the same codebase at a previous point in time) and modify it one step at a time to replicate your situation. Keep the steps small and verify that it works as expected after every change. Even if the set up is completely different to your end goal (eg. using a docker compose file as your working example when your actual problem is in relation to deploying to a cloud service) it can help validate your understanding of the situation and the relationships between various compontents.
- When you try different approaches, make sure you've cleaned up completely from your previous approach before starting the new one.
- Change one thing at a time.
- Google it
- Stackoverflow it
- Actually read the documentation. Make sure it's the documentation for the right version.
- Go back and really really read the error. Properly process each word.
- Read the logs. All the logs. From start to finish.
- Turn debug logging on
- Use a debugger. Stop and take the time to do a tutorial on using a debugger if you haven't used one before.
- Use good old print statements within your code.
- Upgrade to the latest version
- Rubber duck debugging (talking an inanimate object through the problem).
- Look at the git commit log/git blame
- Try git bisect to help identify which commit introduced a particular issue into a codebase. You can even automate this.
- Start again from scratch with a fresh workspace (eg. check out the project again and set it up from the beginning)
- Turn it off and on again. Make sure it's really turned off (eg. docker containers/volumes need to be removed, not just stopped)
- Take a break. Go do some exercise. Go home if it's after home time. The number of times a solution has occurred to me before I've left the building on my way out out make this one of my top tips.
- Check the environment variables (and all the other places that properties could come from).
- Are there logs in /var/log/... or /var/log/syslog?
- Is anything cached? Check all the things that could possibly cache
- Does it need more memory?
- Does it happen on someone else's machine?
- Is it a timezone/date issue?
- Is there a virus checker involved?
- Is there anything asynchronous happening?
- Is there something hitting a timeout limit?
- Is there a race condition?
- Is there a buffer that needs flushing (eg. stdout can be async by default in some languages)
- Is there state leak? Try cleaning up before a process/step, rather than after. If you clean up at the end, it's tempting to optimise the procedure to just clean up things you know about, and very easy for something to get missed as time progresses and more code gets added. If there is an error in your process/step, it's possible that the clean up does not run properly either. If you clean up at the start of a process, you're starting with the assumption that you don't know what is there, so you're more likely to cover everything to get it back to a known state.
- For bash and similar, try adding
set -x
to the script (or running in your shell) to view every command. - Setting
set -Eeuxo pipefail
will make a bash script safer and behave more like a normal programming language. The extra checks may help identify the reason for the issue (be aware that pipefail will change the behaviour ofcommand_1 | command_2
type commands, so make sure you understand this before using it). - Have you fallen into one of the common bash pitfalls?
- Is there a
.
or*
acting as a wildcard? - Are your environment variables exported or just set for the current context?
- Are the parameters for the command in the right order?
- Have you set your shell at the top of the script? eg.
#/bin/sh
or#/bin/bash
- Is it a file permissions issue?
- Different operating systems sometimes have different implementations of common tools, and some have different options. Is there an operating system difference?
- Can you ping the host or telnet to the host/port?
- Is there a network firewall?
- Is there a personal firewall on your local machine?
- Is there a corporate proxy?
- Is there a cloud security policy (eg. AWS IAM resource policy)
- Promises
- Promises
- Try using eslint
- Promises
- Timeouts
- New VM instances don't inherit properties from parent ones the way you'd expect.
- Step 1. Make it manually reproducible on demand
- Step 2. Write a high level failing test that exposes the bug
- Step 3. Drill down until you find the root cause
- Step 4. Write a unit test for the root cause
- Step 5. Fix it.
- Step 6. Run the high level test.
- Step 7. Delete the high level test unless it is really required permentantly.
Write out your problem so that you can ask a real person an effective question. People are much more willing to help out if you can show that you've put in your own effort first, and have respect for their time.
- Write a brief description of what you're trying to achieve, what your expected outcome is, and what the actual behaviour is. Make sure you concisely describe your overall goal, not just the problem you're having, because sometimes you might have a problem because you're trying to achieve the overall goal the wrong way, and if you communicate your goal, someone might be able to point you to a better solution. This helps avoid the X Y Problem.
- The relevant technologies and versions you are using.
- Details of the environment you're using (Mac/Window/Linux, Docker).
- Example code.
- Any relevant logs.
This sounds like a lot of work, but the process of putting all the information together can sometimes help you solve the problem, and it shows respect for the time of the person you're hoping will help you.