Foundations: programming in Psychological science
What does "programming" mean in our field?
“Programming” or “coding” usually refers to preparing a set of instructions for a computer to execute. Technically, “coding” is a narrower term where you write the instructions to solve a concrete problem (e.g., create a bar plot), whereas “programming” is usually used for a larger project that involves different “coding” problems (e.g., analysing my data). In the rest of the book, we will use the two terms interchangeably as this distinction rarely applies to our field. In Psychological research, programming is mostly used to prepare a set of operations that are repeatedly executed. Think, for example, of a situation where you want to analyse the data you obtained from an experiment. This experiment frequently involves testing several participants so you would need to apply the same set of operations several times, one for each participant. This (tedious) problem can easily be solved by programming a set of instructions to tell the computer to: 1) load the participant data, 2) remove unwanted variables, 3) check data quality and 4) compute mean performance. Once these instructions are in a language that the computer can understand, it can repeat them as many times as needed with only one mouse click. If you have done this before with softwares that rely on interactions via an interface (e.g., Excel, OpenOffice, etc.), you will know that performing these operations requires many steps involving multiple windows, menus and tabs. Most dramatically, these many steps would need to be manually repeated for each new participant which can become tedious, and is very error-prone (if you have heard about Macros in Excel, look at this section). Those of us writing this book have all started our career in science by using interface-based software and we have all experienced the cost of making mistakes (e.g., deleting the wrong column in Excel or sorting one column while forgetting to sort the rest of the variables). Some of these mistakes can be caught early whereas others are only apparent after you have concatenated other steps which requires you to then go back to the start and re-do everything. Moreover, these mistakes are not easy to spot due to another problem characteristic of interface-based software: they do not keep a record of what was done, which tabs you open, where you clicked… This makes it very hard to back trace where the error came from. In contrast, when you write a script that performs all these individual steps, this piece of code acts as a record of the operations applied to the data so, in case you make a mistake in one step, you (or someone else looking at your code) can easily identify it. This takes us to the last immediate benefit of solving problems via programming: your steps are reproducible by others. You will learn more about reproducibility in chapter 6, but suffice to say here that your code can be verified, curated and validated by other people to ensure you have done what you intended to. Bonus tip: this “other people” can be yourself after two months or two years: solving your problems via programming will help your future self know what was done. This might sound trivial to you (as it did for many of us), but understanding your previous work is of utmost importance that we cannot stress enough and a major challenge that we all face.
Programming will be beneficial regardless of the branch of psychological research that you choose. Let’s say that you choose Basic Psychology and you will be running experiments with multiple trials, variables and participants: you will need to present participants with a set of stimuli in repetitive trials, record their responses on every trial and save their performance on a file. You could also opt for delving into Social Psychology, where you will have access to massive databases with many demographic data from people, cities or countries: you will need to select which variables you are interested in, filter out cases that do not fulfill your inclusion criteria, sort by a given characteristic and relationships across variables. Perhaps you choose to pursue a career in Health Psychology and you will want to relate genetic markers to physiological variables: you will need to combine databases of genomic information with databases of patients test scores, re-code raw variables, compute indices across them so that you can test for correlations. The example list could continue and the exact operations that you will be performing will differ depending on your field but one thing will be common: you will be saving time and mistakes which will, in turn, spare you lots of frustration and improve your science.
Couldn’t I just do all this in Excel?
Microsoft’s Excel is a great tool for inspecting data when encountering it for the first time. It is also very powerful as it allows you to do many operations with your data from curation, to analysis and visualization. Plus, it has an interface! So, why would you not want to use it? We already described above some of the inherent problems with interface-based tools so we will not repeat them here. But in addition to those, when attempting slightly complex operations (e.g., averaging or combining columns), Excel becomes actually powerful if you use formulas. These will allow you to specify a set of operations that you can apply to all the values within one column (e.g., sum all the values, and divide them by the standard deviation). If this sounds familiar it is because we already described something like this before: this is coding! Excel has its own language with which you can create small pieces of code to be executed within each cell. So, should you use it? Although for some very simple tasks learning a bit of Excel syntax can be useful, our recommendation is to not invest your time on this and instead focus on other languages. The main reason is that Excel syntax is mostly useful for operations applied to individual cells or columns and always within one single file (so it will not be useful for handling files outside one concrete Excel file). This limitation means that whichever time you spend in learning Excel’s syntax will not transfer to other use-cases beyond data wrangling (see following section).
If your collaborators (or you yourself) have worked with Excel extensively before, at some point you might have come across the concept of Macros. These are essentially sets of operations (scripts) that you can build into a spreadsheet in Excel so that you can apply them repeatedly to other files. Although these act effectively like scripts in other languages that we will cover in the next sessions, they have the drawback of using Excel’s unique syntax and, perhaps more importantly, they break super easily when moving between Excel versions or Operative Systems (Windows, MacOS or Linux). This might sound not too bad but all of the authors of this book can vouch for how frustrating this turns out to be: finding out that your Macros no longer work because you updated your software, because you changed computers or because your collaborator is opening in a different computer feels like an immense waste of time and source of huge frustration.
Key use-cases
In your research path you will be using code for what might seem to be very different scenarios. The biggest advantage: you will benefit from a lot of transfer of skills across scenarios. You will learn about this in Chapter 5 but here we are going to briefly describe the key use-in which programming is not only useful, but key to a successful, robust and reproducible workflow.
- Project organization and data curation. Creating folders, sub-folders, moving files, naming your variables or renaming other people’s variables are just a few examples of activities that you will be tempted to do manually. As we described above, if you do any of this via programming, you will benefit from automatization, reproducibility and error-tracing.
- Experimental design and stimulus presentation. Displaying stimuli and collecting responses is one very good example of a very repetitive task which is really important to reproduce consistently and in which it is really important to catch errors early on. Preparing your data collection (from design, to stimulus presentation to data recording) via programming ticks all the important boxes in the process.
- Data Processing and Statistics. Regardless of whether you collect or generate your own data or whether you reuse someone else’s data, your data processing will involve many repetitive steps that you will want to have documented. In addition, at the same time, accessing your analysis via programming will unlock a whole new range of tools that will exponentially boost your skillset.
- Visualization. You have probably seen beautiful plots in published papers and wondered: how could I do something like this? Great news: these are most likely done with code. As soon as you embrace coding as your default way for creating visualizations, you will be able to create many different figures, for many different types of data and, all this, with very little time spent in redoing steps. Have you recruited new participants? Have you changed any of the filters applied to your analysis? Just execute your plotting script once again and out you see the updated version of your plot without any extra manual step.
- Writing and disseminating. Although this use-case can be a bit more niche, programming can also be used to improve your writing, by automatizing repetitive steps, integrating formulas or embedding figures within your manuscript. How cool would it be to click one button and automatically update a figure inside the draft of your paper?