The world’s leading publication for data science, AI, and ML professionals.

SAS to Python/R – Functional vs Line by Line Code Refactoring

Recently I refactored a set of code that calculated hospital ratings using public data provided by the Center for Medicare/Medicaid…

SAS to Python – Functional vs Line by Line Translations

Data frame structures and method eliminate the circuitous logic and extraneous data structures necessary in older procedural languages

Unsplash.com
Unsplash.com

After recently refactoring code for calculating hospital ratings using public data from the Center for Medicare/Medicaid Services (original code and data can be found at https://qualitynet.cms.gov/inpatient/public-reporting/overall-ratings/sas), I thought I would capture my key insights here.

One observation from this conversion was that the power of data frames has completely changed the standard for readability and speed of delivery. Since SAS evolved out of the early procedural language world that included FORTRAN and COBOL with array type data structures with uniform data types and element-wise operations, the level of abstraction that is available via heterogeneous data arrays which are looped over element by element to perform arithmetic or comparisons that often require parallel arrays for later processing.

Here’s an example. In the following SAS code, the objective is to identify rating categories that have less than 100 observations and drop those columns from the dataset:

PROC SQL;
select measure_in_name into: measure_in separated by '' notrim
from include_measure0;
QUIT;
%put &measure_in;
/* &measure_cnt: number of included measure */
PROC SQL;
select count(measure_in_name)
into: measure_cnt
from include_measure0;
QUIT;
%put &measure_cnt;
/*COUNT # HOSPITALS PER MEASURE FOR ALL MEASURES*/
PROC TABULATE DATA=All_data_&year.&quarter out=measure_volume; 
var &measure_all;
table n*(&measure_all);
RUN;
PROC TRANSPOSE data=Measure_volume out=measure_volume_t;
RUN;
/* IDENTIFY MEASURES WITH VOLUMN <=100 */
DATA less100_measure  (drop=_NAME_ _LABEL_ rename = (COL1=freq)); 
SET measure_volume_t (where = (col1<=100));
if _name_ ^= '_PAGE_'  and _name_^='_TABLE_';
measure_name = tranwrd(_NAME_, '_N', '');
RUN;
DATA R.less100_measure;SET less100_measure;run;*OP-2;
/* CREATE a measure list for count<=100 */
PROC SQL;
select measure_Name
into: measure_exclude separated by '' notrim
from Less100_measure;
QUIT;
/* REMOVE MEASURES WHICH HAVE HOSPITAL COUNTS <=100*/
DATA initial_data_&amp;year.&amp;quarter;
SET All_data_&amp;year.&amp;quarter;
/* measure volume <=100*/
drop &amp;measure_exclude ;
RUN;

The code starts out creating a structure to hold the column names and then a structure it populates with the observation counts in each of the named named column. Next, it identifies which columns have less than 100 observations and creates a structure to hold those column names and finally, using those names, drops them from the main data array.

Here is the equivalent python using a pandas dataframe of the same initial data:

# grab all columns with hospital counts>100 
dfObs100=df[df.columns.intersection(df.columns[df.notna().sum()>=100])]

There are no explicit data structures created or element-wise calculation loops executed. Everything is happening inside the the pandas functions, which means that you get the benefit of state of the art internal methods created by the package developers for looping over columns and rows.

Another example is the code for Program 1 in the package which consists of a SAS program file and a macro from another file which calculates the standardized group scores. Here is the SAS code for Program 1:

******************************************
* Outcomes - Mortality       *
******************************************;
*option mprint;
/* count number of measures in Outcome Mortality Group */
PROC SQL;
select count(measure_in_name)
into: measure_OM_cnt /*number of measures in this domain*/
from Outcomes_mortality;/*Outcomes_mortality is generated from the SAS program '0 - Data and Measure Standardization_2021Apr'*/
QUIT;
/*OM is used to define the data name for mortality Group; 
&amp;measure_OM is the measures in Mortality Group;
&amp;measure_OM_cnt is the number of measures in this Group;*/
/*output group score in R.Outcome_mortality*/
%grp_score(&amp;MEASURE_ANALYSIS, OM, &amp;measure_OM, &amp;measure_OM_cnt,R.Outcome_mortality);
***************************************
* Outcomes - Safety of Care     *
***************************************;
/* count number of measures in Outcome Safety Group */
PROC SQL;
select count(measure_in_name)
into: measure_OS_cnt
from Outcomes_safety;/*Outcomes_safety is generated from the SAS program '0 - Data and Measure Standardization_2021Apr'*/
QUIT;
/*OS is used to define the data name for Safety Group; 
&amp;measure_OS is the measures in Safety Group;
&amp;measure_OS_cnt is the number of measures in Safety Group;*/
/*output group score in R.Outcome_safety */
%grp_score(&amp;MEASURE_ANALYSIS, OS, &amp;measure_OS,  &amp;measure_OS_cnt, R.Outcome_safety);
********************************************
* Outcomes - Readmission        *
********************************************;
/* count number of measures in Outcome Readmission Group */
PROC SQL;
select count(measure_in_name)
into: measure_OR_cnt
from Outcomes_readmission;/*Outcomes_readmission is generated from the SAS program '0 - Data and Measure Standardization_2021Apr'*/
QUIT;
/*OR is used to define the data name for Readmission Group; 
&amp;measure_OR is the measures in Readmission Group;
&amp;measure_OR_cnt is the number of measures in Readmission Group;*/
/*output group score in R.Outcome_readmission*/
%grp_score(&amp;MEASURE_ANALYSIS, OR, &amp;measure_OR, &amp;measure_OR_cnt, R.Outcome_readmission);************;
******************************************
*  Patient Experience        *
******************************************;
/* count number of measures in Patient Experience Group */
PROC SQL;
select count(measure_in_name)
into: measure_PtExp_cnt
from Ptexp;/*Ptexp is generated from the SAS program '0 - Data and Measure Standardization_2021Apr'*/
QUIT;
/*PtExp is used to define the data name for Patient Experience Group; 
&amp;measure_PtExp is the measures in Patient Experience Group;
&amp;measure_PtExp_cnt is the number of measures in Patient Experience Group;*/
/*output group score in R.PtExp*/
%grp_score(&amp;MEASURE_ANALYSIS, PtExp, &amp;measure_PtExp,  &amp;measure_PtExp_cnt,R.PtExp);
**********************************************
* Timely and Effective Care                  *
**********************************************;
/* count number of measures in Timely and Effective Care */
PROC SQL;
select count(measure_in_name)
into: measure_Process_cnt
from Process;/*Process is generated from the SAS program '0 - Data and Measure Standardization_2021Apr'*/
QUIT;

Here is the accompanying macro used in Program 1:

**********************************************************
* macro for calcuating group score for each measure group*
**********************************************************;
%macro grp_score(indsn, gp, varlist,  nmeasure, Out_avg);
  data dat0 (keep=provider_id &amp;varlist.  c1-c&amp;nmeasure. total_cnt measure_wt avg ); 
  set &amp;indsn.;
array M(1:&amp;nmeasure.) &amp;varlist.;
  array C (1:&amp;nmeasure.) C1-C&amp;nmeasure.;
DO k =1 TO &amp;nmeasure.;
  if m[k] ^=. then C[k]=1;
     else C[k]=0;
  END;
  total_cnt=sum(of c1-c&amp;nmeasure.);

  if total_cnt>0 then do;
  measure_wt=1/total_cnt;
  avg=sum(of &amp;varlist.)*measure_wt;
  end;
  run;

  *standardization of group score;
  PROC STANDARD data=dat0 mean=0 std=1 out=dat1;var avg;run;
*add mean and stddev into the data;
  ods output summary=new(drop=variable);
  proc means data=dat0 stackodsoutput mean std ;
   var avg;
  run;
proc sql; 
    create table dat2 as
    select  *
    from dat0, new;
  quit;
data &amp;out_avg;merge dat2(rename=avg=score_before_std) dat1(keep=provider_ID avg rename=avg=grp_score);by provider_ID;run;
%mend;

Here is my equivalent Program 1 code in python:

# mortality scores
mortality=getScores(dfAllStd,df,mortalityFields)
# safety scores
safety=getScores(dfAllStd,df,safetyFields)
# readmission scores
readmit=getScores(dfAllStd,df,readmitFields)
# patient care scores
patient=getScores(dfAllStd,df,pxFields)
# process scores
process=getScores(dfAllStd,df,processFields)

with a supporting python function, getScores(), roughly the equivalent of the SAS macro:

def getScores(dfStd,dfAll,fields):
  # get fields counts and weights
  cnt=dfStd[dfStd.columns.intersection(fields)].count(axis=1,numeric_only=True)
  wt=(1/cnt).replace(np.inf,np.nan)
#get raw scores
  scoreB4=dfStd[dfAll.columns.intersection(fields)].mean(axis=1)
  # standardize scores
  sMean=np.nanmean(scoreB4)
  sStddev=np.nanstd(scoreB4)
  score=(scoreB4-sMean)/sStddev
  #generate table of values
  table5=pd.DataFrame({'id':df['PROVIDER_ID'],'count':dfAllStd[dfAllStd.columns.intersection(fields)].count(axis=1,numeric_only=True),
                     'measure_wt':wt,'score_before':scoreB4,'score_std':score}) 
  return table5

As you can see again, the python code is smaller and easier to read with implicit pandas loops.

Similar compact results could be achieved by using dataframes in R. In fact, implicit "apply" function loops are critical for R computing performance.

Another lesson learned was that since an automated refactoring tool would just recreate the line by line structure of the original SAS code, a typical translator would just duplicate the old structure along with its original design issues. This might make sense for operations where the data structure of the input and output is unlikely to change but, as the availability of data grows and grows it is very likely that inputting more parallel fields will be increasingly desired and likely. And new models with new outputs will be more commonplace, making it very likely that the code will need to be adaptable to new outputs derived from new modelling techniques.

If the intention of language transformation is to modernize and adapt the code to new data channels and modelling capability, automated code refactoring would be an obstacle. So, before you begin refactoring old data science code it is important you identify which parts of the code you want to leverage and plan for the use of the flexible capabilities available in the target language and then refactor using functional specifications with the old code as reference.

In summary, key learnings were

  • Line-by-line conversion locks in old program structure which is likely suboptimal versus new language capabilities
  • Dataframes coupled with functions that operate on them create better, more readable code, thus easier to maintain
  • Simple translators that do line by line translation might enable access to new methods and models but lock critical program areas into an inferior data and logic paradigm
  • Using embedded functionals in pandas and other packages provides state-of-the-art performance for loops over column and row operations

Finally, the same issue would exist for translating FORTRAN and COBOL, which use similar data structures and loops. The real power of modern languages is in dataframes and functionals, and unleashing it requires non-linear connections beyond simple machine translations.


Written By

Topics:

Related Articles