Skip to main content


Showing posts from 2017

Macro scheme for generation of RFM-I data

Recency, frequency, monetary - interaction (RFM-I) are guiding principles for extracting customer data in the process of generating segments and prediction models. Managing and processing of RFM-I data are often the most time consuming part compared to the analysis. 

We often see various kinds of tables gathered from different relational databases and a need to scan those tables record-by-record. SAS is an appropriate choice for even huge data tables due to the principles governing the implementation of the the SAS engine.

Below a pseudo code macro extracted from an existing code base combining several data sources. The programming language is SAS, since SAS offers great transparency and robustness. I considered both proc sql, hash look-up tables and combinations of data steps and application of proc means, implemented those alternatives and concluded division of data into presorted historic data tables and present data tables lead to both transparency and speed-up of execution of cod…

Extracting variables and estimates from SAS prediction models

SAS ods output statements provide a simple alternative to advanced text manipulation techniques.

The example below extracts selected variables from proc hpgenselect and use these in proc corr, proc gampl, which outputs predicted probabilities. A final ods output statement in proc logistic extracts the concordance index, i.e. the measure for second order predictive capabilities of the model given as the probability that two different observations are correctly ordered with respect to the model based risk score.

Extracting selected variables from the hpgenselect procedure using code

Several SAS procedures generate code from which the user may extract critical information.

In the example below I extract variables selected by proc hpgenselect and input these to proc corr and proc gampl without writing intermediate results to the harddrive:

The generated sourcecode 'code' is read a line at the time. Appropriate text lines are kept and concatenated into a single string using retained variables for identification (expr) and text (text).
The variable containing variable names are stored in a macro variable 'variables' and a data file test in the work library. The list may be inspected in both the data file work.test or in the SAS log.

Modeling gender and age adjusted incidence rates

National Health Institute (NHI) provides a tool box for calculation of cancer incidence and percentage change. Their algorithm for Jointpoint Trend Analysis is well-documented but does not provide the best tool at hand for most problems. The normal approximation is not the most optimal choice for situations with a low incidence rate in which I would recommend to apply modern logistic regression algorithms which are far more versatile. In the logistic regression model, either direct or indirect incidence rates are modeled using population numbers by age, year, and gender and number of cases specific to the given year, gender, and age group. There are no missing data if the registry is complete, compared to the normal model in which we cannot use data points for years/age-gender groups with no events and perform simulation to get approximate estimates.

The difference between careful parametrization in a binomial regression model and the plug-and-play functionality of the NHI suite becom…

Dummy variables in SAS

It is difficult to create indicator or dummy variables from factor variables in SAS programming language.

The macro below assumes an integer factor variable (factor levels with values 0, 1, 2, 3, ...) and no missing values or formats. Here is a way to ensure this simple assumption is fulfilled. Add a few lines to an appropriate data step:

if missing(factor) then factor=0; format factor; To convert a non-numeric factor variable to an integer factor variable create a format and construct an integer factor variable:

proc format library=WORK; value $factor 'firstlevel'='1' 'secondlevel'='2' ... 'finallevel'='n' ; run; data newdataset; set olddataset; newfactor=input(put(factor,$factor.),8); run; You are now able to define and run the macro.

/* Macro to generate dummy variables. */ %MACRO GETCON; %DO I = 1 %TO &N; %IF &&M&I = 0 %THEN %GOTO OUT; IF &factor = &&M&I THEN &factor&I = 1; ELS…

A multigroup comparison of IRR using a modified Fleiss' Kappa statistic

Inter rater reliability (IRR) among more than two raters may be calculated using a modified Fleiss' Kappa. I suggest a modified version of the procedures for modified Fleiss' Kappa enabling the surveyor to compare different groups of respondents across a range of evaluators.

IRR for nomial data
IRR for ordinal data using a linear weight matrix

NORDCAN - incidences and numbers

Below components to a R-script, which will extract age-gender specific rates and proportions from Nordic databases divided by year and geography.

Link to NORDCAN website: >click here<



#Table mapping ICD codes and cancer types

# ICD-10Label
# C00-14\C10.1 Lip, oral cavity and pharynx
# C00.0-2,C00.5-9 Lip
# C00.3-4,C02-04,C05.0,C06 Oral cavity
# C01,C05.1-9,C09,C10.0,C10.2-9 Oropharynx
# C07-08 Salivary glands
# C11 Nasopharynx
# C12-13 Hypopharynx
# C14 Pharynx, ill-defined
# C15 Oesophagus
# C16 Stomach
# C17 Small intestine
# C18 Colon
# C18-21 Colorectal
# C19-21 Rectum and anus
# C22 Liver
# C23-24 Gallbladder
# C25 Pancreas
# C26,C39,C76-80, C97,D47 Unknown and ill-defined
# C30-31 Nose, sinuses
# C32+C10.1 Larynx
# C33-34 Lung
# C37,C38.0-3,C38.8,C45.1-9,C46.2-9,C47-48,C74,C75.0,C75.4-9,C88,D46 Other specified cancers
# C38.4+C45.0 Pleura
# C40-41 Bone
# C43 Melanoma of skin
# C44+C46.0 Skin, non-melanoma
# C49+C46.1 Soft tissues
# C…

Creating dummy variables in R

Randy Zwitch has a blog entry on creation of dummy variables from factor levels.

example<- span="">"A","A","B","F","C","G","C","D","E","F"))names(example)<- span="">"strcol"#For every unique value in the string column, create a new 1/0 column #This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data for(levelinunique(example$strcol)){example[paste("dummy",level,sep="_")]<- span="">ifelse(example$strcol==level,1,0)}viewraw Often you encounter special characters in which case you can use gsub and regular expressions
example<- span="">"AÆ","AÆ","B","FÅ","C","G","C","D",