by Ginna Gomez with Greg Page
We built the data set for this exercise, by simulating a survey that asked respondents to rank a series of 24 separate cell phone monthly plan bundles, with 1 being the best, and 24 being the worst. After presenting the survey to 1000 respondents, each bundle of options receives a single ranking, based on its average ranking from among all respondents.
The plans include two options for Internet data, three options for call minutes, two options for music, and two options for social network access. The features are based on real phone plan offerings from a Colombian telecom company. The prices are loosely based on actual feature prices, after being converted back to US dollars from Colombian pesos.
The data set has 7 variables, the specifications of the variables can be seen in the following table:
bundleID: ID of each row of the data set.
gigabytes: Number of Gigabytes to navigate on the internet. Options are 4 and 10.
minutes: Number of minutes the people have to make calls. Options 100, 200, and 400.
music: Whether the plan includes free access to music ("yes" or "no" values).
social_networks: Whether the plan includes unlimited access to social networks ("yes" or "no" values).
price: Price of the plan in Colombia, expressed in USD.
rank: Rank from 1 to 24 of the preferred prepaid plans, as determined by the survey answers from nearly 1000 respondents.
fliprank: The "flipped" rank for each bundle, presented so that favored bundles take higher values, rather than lower ones. This will make the linear regression coefficients easier to interpret.
As the respondents evaluate the bundles, they do so with costs in mind – otherwise, we could expect respondents to simply choose the "best", or most feature-laden, options. The incremental feature costs are shown here:

The cost of any particular bundle can be found by adding the features in the table above. For instance, a 10 Gigabyte plan, with 400 calling minutes, no music, and access to social networks would cost $2.50 + $3.35 + $0.00 + $1.55, or $7.40. A 4.5 Gigabyte plan, with 200 calling minutes, and access to music, but not to social networks, would cost $1.50 + $2.95 + $1.60 + 0.00, or $6.05.
In order to run the linear regression and be able to read the results more easily, we modified the rank variable, giving the #1 bundle a CorrectedRank value of 24, the #2 bundle a CorrectedRank value of 23, the #3 bundle a CorrectedRank value of 22, and so on.
Why use a Complete Ranking System?
In a previous article, we wrote about a ratings-based Conjoint analysis system, in which survey respondents considered features on an amusement park ride, rating various feature combinations from 1 to 10.
While such a system lends itself well to linear modeling, it is subject to a notable flaw – not all respondents use the same rating "baseline" when asked to assign values on a 1–10 scale. One movie reviewer, for instance, might average a 7.2 out of 10, while another reviewer who sees the same group of movies gives an average closer to 5.0.
With a complete ranking system, the baselining problem is solved, as all respondents are simply asked to determine the relative attractiveness of each option. One possible drawback with such a system, though, is the risk of survey fatigue – if there are too many bundles to rank, respondents might become overwhelmed. As more features and more levels are added, the impact to the total number of bundles is multiplicative. Here, with just 2 data options, 3 calling minute options, 2 music options, and 2 social network options, we are already at 24 total combinations (2 x 3 x 2 x 2).
Linear Modeling for Full Rank Scenario
As noted above, to improve the suitability for linear regression modeling, we "flipped" the rankings so that higher values would be associated with more favored bundles. This makes the interpretation of the linear regression coefficients easier – positive values are now associated with better options, while negative values are associated with worse ones.
Below you can see the clean dataset that was read in the environment:

Prior to running such a model, all of the input features should be dummified, including the numeric values such as the data plan and the number of calling minutes per month. This way, the model results will display each feature option as a discrete choice. Doing so means that we can attach a precise coefficient value to each of those options, rather than using a single numeric coefficient to try to express preferences across an entire continuous range of possible options.
The get_dummies() function from pandas helped us prepare the variables for the linear modelling:

Then the linear model is built, with fliprank as the outcome variable and with the dummified levels for gigabytes, minutes, music, and social_networks as inputs, using the LinearRegression module from scikit-learn:

As for the suitability of using ordinary least squares linear regression, there are a couple of concerns worth noting.
First, the outcome values in this dataset are uniformly distributed, and bound by a range of 1 to 24. This range constraint could present a problem if we were using this model for predictive purposes (how would we interpret a value less than 1, or greater than 24?), but it does not prevent us from using the model for an explanatory purpose. Our goal here is simply to better understand the relationship between feature options and customer preference, and we can accomplish that here.
Second, there could be increased risk of heteroskedasticity, given the non-normal distribution of the response variable. We used regression diagnostic plots to check for this, and found no evidence of heteroskedasticity in the model results.

Interpreting the Results
The model coefficients reveal some valuable patterns:

They show us, for instance, that survey respondents are very strongly in favor of having the social network access and the 10 gigabyte data plan.
The negative values for the larger calling minute numbers, and for the music option, do not necessarily mean that customers do not want these options at all. Instead, it may indicate that consumers do not wish to pay the incremental costs associated with these features. The telecom carrier may wish to modify these options in future surveys – perhaps if the associated costs were reduced, respondents would be more favorably disposed towards these options.
Alternatively, the lack of interest in these features could indicate other things about consumer tastes. Perhaps the consumers in this group are accustomed to listening to music through other formats, and simply do not plan on using more than 100 calling minutes per month. Through continual iteration of the feature and pricing options, along with analysis of actual subscriber data, the telecom company could continually work to find the ideal set of choices for its consumer base.
