As always, I loved teaching my graduate seminar on professional and career development this past semester. So, I hope this summer to blog about some of the topics we cover in that course.
In the past few years there have been several high profile cases of faculty members who were found to have made egregious errors in data management. In many of these cases, we are talking about blatant offenses like fabricating data, creating an entire dataset out of thin air, or purposefully throwing out participants because they do not support the narrative desired. In other cases, there is manipulation of data. I am not going to name names here, though there are news stories about many of the cases we discuss in my reading list. Note that just because an article is on my reading list, doesn’t mean that I use it as an example of a blatant error – some are subtle or ones that may not be errors at all.
To be clear, there can also be ethical issues at the analysis and publication stage related to plagiarism, fishing, writing up, authorship issues, peer review, etc. I am distinguishing those issues (which I hope to write about another time) from ethical issues in data management.
My suggestions about best practices are in part to make sure you and your collaborators are engaging in ethical practices. But they also help to insure that if you are ever falsely accused of data misconduct, you can defend yourself.
How do you avoid these ethical issues in data management? The short and easy answer is, “don’t falsify or unethically manipulate your data.” But obviously there are a number of other guiding principles for engaging in best practices in ethical data management. These include (with lots of caveats):
- Choose collaborators carefully. Some of the news stories, and some personal stories I know, happened when one researcher fabricated or manipulated data and the collaborators/co-authors didn’t realize it. So, think about who you plan to collaborate with and whether you believe they are trustworthy. If someone approaches you with an idea too good to be true… it just might be.
- Know your data. Part of knowing your data relates to your collaborators. In some of the stories of fabricated data, someone asked a research question, and magically, perfect data to address the research question appeared in a short period of time. It reminds me a bit of the college cheating scandal and the alleged innocence of some of the kids. Specifically the students who received high SAT scores they wouldn’t have gotten on their own (if your scores show up and they are much better than any of your practice exams, wouldn’t you wonder?). If a collaborator suddenly has amazing data, ask to look at it together. Don’t just trust a table in MS Word with results.
- Understand your data. Another aspect of knowing your data is true even when you collect your own data or are working directly with data. Understand the data you work with. Run means and standard deviations on all of your data. Make sure you don’t have any miscoded data, or miscoded missing data (those -99’s that aren’t correctly coded as missing is a great example of how to mess up analyses). I once had a student come to me with some correlations she ran on a large sample (about 700 participants). The first thing I noticed when I looked at the output was that the n’s in the correlation table were hovering around 30. It turned out that she hadn’t recoded some variables for which certain questions weren’t asked based on skip patterns. But she had not noticed before bringing the output to me. Obviously all of those analyses were useless as they were based on less than 10% of the sample. Probably she would have caught it eventually, but what if she hadn’t and tried to publish it? Understand your own data.
- Data cleaning before analysis. As I said, understanding your data is important. And often when you look at your data there are things you need to clean. One way to avoid unethical data cleaning is to clean the data, and make cleaning decisions, before you run analyses. If you wait until after, you risk the scenario where you go – hey, that didn’t turn out how I wanted it to, let me see if there are any outliers… okay, let’s just drop these three people, and, viola!, now my hypothesis is supported. Instead, look for outliers before you do any analyses. You can identity outliers without bias in your decision about whom to include. Similarly, you can make decisions about recoding. For instance, we have sometimes recoded large numbers in a frequency count – like number of lifetime sexual partners, or number of times attending religious services in a year – with a cap. But again, we do that in advance. It would be less ethical to, say, run correlations, realize it wasn’t significant, look at the scatterplot and identify outliers, and only then decide how to cap certain participants’ values. You risk making decisions that support the findings you want to have.
- Document data cleaning decisions. When you clean and change any data, document all of those decisions. If it ever comes back to you (e.g., someone accuses you of manipulating data), you will have a clear record of any recoding you did and the reasons you did so.
- Make data cleaning decisions public. Do not make these decisions in isolation by yourself. Try to make them with a group (your advisor, your student, others on your grant/project) so that there is open discussion and consensus on decisions. In one of my longitudinal studies, we had a sizeable number of “born again virgins” – participants who reported having sex in their lifetime at one data collection point, and at a later data collection point reported that they never had sex in their lifetime. We met as a group and made decisions about how to handle these cases; looked carefully at the data for each participant; and made decisions rules that we could apply to all participants. We documented everything during this very public process.
- Save your syntax. When I first ran analyses in SPSS, as a full time research assistant on a large project, we didn’t have menus in SPSS. Heck, my computer didn’t even have a mouse so there was no way to point and click. Instead, I had to write all of the syntax myself (anyone reading this post as old as me? Remember assigning every variable name and label in the syntax? One missing comma, and it wouldn’t run?). Now, it is very easy to run analyses from menus without ever having a record of the syntax. Which seems easy, until you later try to recreate the analyses because a reviewer told you to drop a few participants for some reason. And you cannot run analyses that match your original results. Tears may spill. Don’t let that happen. If you use menus, make sure that you paste all analyses into syntax files and save them. You can annotate your syntax files so that they are easy to follow – e.g., explain each analysis and why you did it. Future-You will be very grateful to Today-You.
- Don’t analyze partial datasets. It is very tempting to “just check” how things are going when you have part of your data collected but haven’t completed data collection. There is little good that can come of such actions. What if nothing is significant? You aren’t going to stop data collection, are you? What if you find something significant, but when you finish data collection, it is no longer significant? Resist the temptation.
- Archive raw data: Make sure that you keep your raw data (and all your syntax) for a number of years after publication. Some societies or journals have guidelines on how long. If you collected physical data, you will need to keep the files. If you collected only electronic data, it is not hard to keep the data. Even when you clean the data and change variables, make sure you keep a version of your file with the original raw data before any cleaning or recoding.
Obviously, there are times you may not be able to follow every single rule here. When I went on job interviews I was halfway through dissertation data collection. I couldn’t exactly show up to job talks with no data to report on my dissertation. So, I ran the planned analyses on the half of the dataset that I had collected. I presented these preliminary results with a lot of caveats and apologies. But in that moment, the practical outweighed the best practice. Some of these guidelines, though, are true in any scenario. I can’t ever think of a good – or at least ethical – reason to not save your raw data. Then, some day, if someone accuses you of falsifying it or manipulating it, you can prove them wrong.
“Best practices in ethical data management first appeared on Eva Lefkowitz’s blog on June 4, 2019.”