Professional Development Blog

Ethical data management

1/30/2015

In Professional Development, we discussed violations of ethical data management, using some high profile examples as case studies (e.g., Hauser, Stapel, Woo-Suk). Instead of having all students read about each case, I assigned each student to read an article about their own case study (see syllabus for exact readings). I think it led to good discussion, where each student could present their case to their classmates, rather than my lecturing about each case, or everyone coming in with the same knowledge base. Many of the cases we discussed were rather glaring/obvious, with entire datasets fabricated or manipulated. But we also talked about fuzzier cases and where one might draw the line.

Then we talked about best practices in data management. The goal was two-fold – first, recognizing how to avoid violating ethical data management principles. Second, even when you are doing everything ethically, making sure you do it in a way where no one could suspect you of unethical practices. I should be clear that we didn’t try to tackle IRB/human subjects ethical issues this week – we really focused on the data management end of things.

I suggested the following best practices:

Know your collaborators well. This point is important whether you’re talking about mentors, mentees, or collaborators on the same level. In some of the cases we discussed, there were collaborators who were getting or hearing about great data that turned out to be problematic. I’m not saying that the collaborators/students should have recognized the situation sooner – many people have been tricked in similar ways. But the more closely you work with someone, look at data together, and know the person you work with, the easier it might be to recognize problems. In many of these cases, people who had nothing to do with the fabricated or massaged data had publications that were rescinded and thus disappeared from their CVs. That’s a huge deal for someone junior, and you do not want that to happen to you.

Know your own data. Before you run analyses, look at your data, your means and standard deviations, get a sense of what you have. I don’t mean start running analyses to test hypotheses at the start, but running descriptives, identifying problems with scales or measures… doing these things early can prevent problems later.

Clean data before analyses. No data are perfect. Data have outliers. Data have inconsistent responses. But when do you address these issues? Don’t wait until your results are not significant to poke around and look for data to eliminate. Instead, before any analyses, clean your data. Sometimes participants report they’ve had 20,000 sex partners, or that they’ve had sex 9000 times in the past 3 months. Other times participants answer “1” to every item on a 7 point scale even though some items are reverse scored. It’s acceptable to clean data, or even to throw out improbable data, as long as your decision rules are logical, consistent, pre-determined, and not decided after you test your hypotheses.

Make data cleaning decisions openly. Be open about the data cleaning process. Don’t make these decisions on your own, and then lose all of the raw data (pesky fires!). Make the decision process public and based on group consensus about how to handle these issues.

Document data cleaning decisions. Document decision rules, and any cases that were changed from the raw data. Be ready to show someone your decision rules, raw data, and cleaned data, if asked.

Save syntax. All syntax you ever write. One of my students, Rose Wesche, wrote a whole blog post on this point recently, so you can just read what she said.

Analyzing partial datasets. This is a tough one. There are times when analyzing a partial dataset is highly useful. You want to submit for a conference but your data aren’t all in. I did my job talk on the first half of my dissertation data. If I hadn’t, I wouldn’t have had a job talk. But a risk here is analyzing partial data repeatedly until the results duck under the magical p < .05, and then ending data collection. If you really must analyze a partial dataset, be sure you know what your final N will be, and don’t deviate from it.

Archive data post-publication. APA says 5 years after the publication. Because many of us publish for many years after publication, it’s important to archive data for many years.

Students generated the following additional ideas:

Write a clear methods section, so that others can replicate your methods.

Imagine your worst enemy over your shoulder. Apparently my husband shared this point last year in their methods class – nice to know they listened/retained it. That is, when making data cleaning and analysis decisions, make sure they’re justifiable.

General transparency.

Change the publish or perish culture. Students were concerned that many ethical violations occurred because of intense pressure to publish in order to succeed and/or obtain tenure. They thought a culture shift would decrease the prevalence of such incidents.

Stress management. Related to the prior point, as individuals, we might not be able to change the culture, but we could work on our own management of the pressures of academia, so that we can make wise/ethical decisions.

What did we miss? What’s important to teach students about being future scientists/researchers?

“The post Ethical issues in data management first appeared on Eva Lefkowitz’s blog on January 30, 2015.”

0 Comments

Ethical data management

Leave a Reply.

Eva S. Lefkowitz

Categories

Archives

Blogs I Read