So there was a discussion on
started by Michael Eisen using the
#SuccessWithoutGlamour discussing the necessity of publishing
in the Big Three journals (CNS, or Cell, Nature, and Science).
Heartening responses, but I suspected there might be some selection bias
and decided to take a marginally more quantitative look.
I thought this poster by Priem and colleagues was interesting and took it as a starting point (Priem, Costello, and Dzuba 2012). I’m rather pressed for time, so I couldn’t try their manual approach to look at faculty, so I decided to narrow the question considerably. I didn’t want to spend a bunch of time parsing HTML either, so I decided to focus on the current NSF biology postdoctoral fellows. I exported a csv from the NSF’s search containing 88 active or expired fellowships awarded after 21 June 2010 (around the time of PLoS One’s first impact factor).
import csv, os, sys from Bio import Entrez Entrez.email = "firstname.lastname@example.org" nsf = csv.DictReader(open("Awards.csv", 'rt')) names =  for row in nsf: names.append(row['PrincipalInvestigator'])
As we see above, I used Biopython’s Entrez interface to PubMed (Cock et al. 2009). I manually located the NLM Unique IDs for the CNS journals (Nature:410462, Science:0404511, Cell:0413066), and will use those in the search:
glamour_NLM_ID = ['410462','0404511','0413066'] annointed_ones =  no_xml = 
Next, I looped over each name, searching PubMed, then extracting any hits, and skipping over any names which lacked them:
for name in names: # Search pubmed for all hits with our authors name try: search = Entrez.esearch(db="pubmed", term=''.join([name, '[AUTH]'])) record = Entrez.read(search) # Get the records if len(record["IdList"]) > 0: hits = Entrez.efetch(db="pubmed", id=record["IdList"], retmode="xml") hitlist = Entrez.read(hits) hits.close() # make a list of NLM UIDs nlmIDs=[hit['MedlineCitation']['MedlineJournalInfo']['NlmUniqueID'] for hit in hitlist] if len([val for val in nlmIDs if val in glamour_NLM_ID]) > 0: # Wow, congrats! journal_titles = [hit['MedlineCitation']['Article']['Journal']['Title'] for hit in hitlist] annointed_ones.append(name) else: continue else: pass except NotXMLError: no_xml.append(name)
Code and data available here.
>>> print len(no_xml) 0
I collected any names with hits in the
annointed_ones list, which we
can compare with all the names:
>>> pct = float(len(annointed_ones))/len(names) * 100 >>> pct 5.681818181818182
So we have five Postdoctoral fellows with CNS publications, or about 5.7% of the total.
This is a mere proof of concept. Obvious follow-ups:
- What about OA publications?
- This needs to be expanded to be more specific, and to account for common surnames.
- Faculty, of course, are a different animal entirely. You’d probably want to look at tenure track faculty, but I can’t think of a clean and automated way to do so.
And others which escape me at the moment. I may return to this in the future, at the very least to rerun this with OA journals included, or to get more detailed information on the publications in which the fellows did publish.
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics [Internet] 25:1422–1423. Available from: http://bioinformatics.oxfordjournals.org/content/25/11/1422
Priem J, Costello K, Dzuba T. 2012. Prevalence and use of Twitter among scholars. FigShare.