Skip to content

Instantly share code, notes, and snippets.

@ddofer
Created August 9, 2017 12:33
Show Gist options
  • Save ddofer/b88d5a44bb83eebd87480f6a83a8f932 to your computer and use it in GitHub Desktop.
Save ddofer/b88d5a44bb83eebd87480f6a83a8f932 to your computer and use it in GitHub Desktop.
Filter lab tests for tests that occured for at least K users distinctly, and then filter by the most frequent (non distinct) labs. R
#LABS
#get most frequent lab tests, for distinct patients.
#Filter for distinct by user:
# data.labs = as.data.frame(data.labs)
# data.labs.userDistinct = subset(as.data.table(data.labs),select=c("guid_tz","kod_bdika")) #ORIG
# data.labs.userDistinct= unique(data.labs.userDistinct) #ORIG
data.labs.userDistinct= unique(data.labs,by="guid_tz") #changed
#Filter all Labs data!
"Lab tests that occured for at least K unique users:"
commonlabNames = sort(table(data.labs.userDistinct$kod_bdika)[table(data.labs.userDistinct$kod_bdika)>25],decreasing=T) # Keeps supermajority of labs
data.labs = data.labs[data.labs$kod_bdika %in% commonlabNames, ] # Get data of labs with only the labtests which occured at least K times for unique patients
##
# Labs which appear at least K times:
# sort(table(data.labs$kod_bdika)[table(data.labs$kod_bdika)>250],decreasing=T) ## 336 (note that we're not normalizing by occurences per test vs per user)
# FreqlabNames = sort(table(data.labs$kod_bdika)[table(data.labs$kod_bdika)>350],decreasing=T)
FreqlabNames = names(sort(table(data.labs.userDistinct$kod_bdika),decreasing=T)[1:250]) #most frequent tests. note long tail
data.labs.freq = data.labs[data.labs$kod_bdika %in% FreqlabNames, ] # Get data of labs with only the most frequent labtests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment