Skip to main content

Creating dummy variables in R

Randy Zwitch has a blog entry on creation of dummy variables from factor levels.

example <- span=""> as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- span=""> "strcol"

#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
  example[paste("dummy", level, sep = "_")] <- span=""> ifelse(example$strcol == level, 1, 0)
}
view raw
Often you encounter special characters in which case you can use gsub and regular expressions
example <- span=""> as.data.frame(c("AÆ", "AÆ", "B", "FÅ", "C", "G", "C", "D", "E", "FÅ"))
names(example) <- span=""> "strcol"

#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
  example[gsub('[^a-zA-Z0-9_],"",paste("dummy", level, sep = "_"),fixed=FALSE)] <- span=""> ifelse(example$strcol == level, 1, 0)
} 
You may also use levels instead of unique in conjunction with subsetting, e.g. levels(example$strcol)[-1] to create dummy variables mapping reference level to baseline/intercept in your regression model.
Equation formulas are generated with the paste function:
paste("somevar ~",paste(names(dataframe),sep="",collapse="+"))

Comments

Popular posts from this blog

HackRF on Windows 8

This technical note is based on  an extract from thread . I have made several changes and added recommendations. I have experienced lot of latency using GnuRadio and HackRF on Pentoo Linux, so I wanted to try out GnuRadio on Windows. HackRF One is a transceiver, so besides SDR capabilities, it can also transmit signals, inkluding sweeping a given range, uniform and Gaussian signals. Pentoo Linux provides the most direct access to HackRF and toolboxes. Install Pentoo Linux on a separate drive, then you can use osmocom_siggen from a terminal to transmit signals such as near-field GSM bursts, which will only be detectable within a meter. Installation of MGWin and cmake: Download and install the following packages: - MinGW Setup (Go to the Installer directory and download setup file) - CMake (I am using CMake 3.2.2 and I installed it in C:\CMake, this path is important in the commands we must send in the MinGW shell) Download and extract the packages

Alder/korrekt århundrede udfra cpr nummer

De fleste, der arbejder med registre eller databaser, står ofte med problemstillingen, at alder er uoplyst, medens cpr-nummer er kendt. Hvordan regner man den ud? Følgende regel er gældende: Hvis syvende ciffer er 0, 1, 2 eller 3 er man født i det 20. århunderede (1900-tallet) Ligeledes, hvis syvende ciffer er 4 eller 9, og årstallet (femte og sjette ciffer) er større end eller lig 37. Endelig er man født i det 19. århundrede (1800-tallet) hvis syvende ciffer er 5, 6, 7 eller 8 og årstallet er større end eller lig 58. Nedenfor finder du eksempel i SAS kode: En lille makro, der udover fødselsdato også udregner køn samt den præcise alder givet datovariabel. Kilde: Opbygning af CPR nummeret, cpr.dk proc format library=work; value gender 0="Female" 1="Male" ; run; %macro agefromCPR(cpr,datevar=inddto,birthvar=birth,agevar=age); dy_temp=input(substrn(&cpr,1,2),2.); mt_temp=input(substrn(&cpr,3,2),2.); yr_temp=input(substrn(&cpr,5,2),

Comorbidity indexes in SQL

Generating Elixhauser comorbidity index from Danish National Health Register as relational database. ( ICD 10 Coding  in SAS) A lookup-table based version of Charlson comorbidity index I made in SQL. A similar approach can be applied to Elixhauser. SELECT V_CPR, MAX(EI1)+MAX(EI2)+MAX(EI3)+MAX(EI4)+MAX(EI5)+ MAX(EI6)+MAX(EI7)+MAX(EI8)+MAX(EI9)+MAX(EI10)+ MAX(EI11)+MAX(EI12)+MAX(EI13)+MAX(EI14)+MAX(EI15)+ MAX(EI16)+MAX(EI17)+MAX(EI18)+MAX(EI19)+MAX(EI20)+ MAX(EI21)+MAX(EI22)+MAX(EI23)+MAX(EI24)+MAX(EI25)+ MAX(EI26)+MAX(EI27)+MAX(EI28)+MAX(EI29)+MAX(EI30)+MAX(EI31) AS Elixhauser FROM (SELECT V_CPR, -- Congestive Heart Failure CASE WHEN DIAG LIKE 'DI099%' OR DIAG LIKE 'DI110%' OR DIAG LIKE 'DI130%' OR DIAG LIKE 'DI132%' OR DIAG LIKE 'DI255%' OR DIAG LIKE 'DI420%' OR DIAG LIKE 'DI425%' OR DIAG LIKE 'DI426%' OR DIAG LIKE 'DI427%' OR DIAG LIKE 'DI428%' OR DIAG LIKE 'DI429%' OR D