Archive for Data Mining

How to Determine What Data to Combine

There is a lot of value in combining data from one business area with data from another business area. Similar to a jigsaw puzzle, when we combine data sets and put the pieces together, we get a complete picture of customers, events, and activities. But how do you know what data to combine?

Take Inventory of What You Have

To get started, take inventory of the data you already have available in your business area. Let’s take members for example. Membership teams often require a high level of granularity. They also have years of membership data that can be leveraged. The data they have may be stored in their Association Management System, Customer Relationship Management system, and their Financial Management System.
After identifying the data sources, consider what data is stored in each data source. Identify the file type and how you extract or integrate the data with other systems.

Consider What’s Missing

To determine what data could augment your existing data source, think about the aspects of the customer or activity that you care about.
What information could help answer your business question? If you don’t have a business question, what information would provide additional insight on customer behavior?
The membership team typically has data on when a member joined, membership type, length of membership, contact information, and dues payments. What other information would help them understand members? It may be helpful to combine membership data with components from other areas such as the number of events attended in the past two years, the last meeting attended, age, member status, tenure in the industry, and total spending in the past year.

Combine and Analyze

Combine the data and analyze it. Look for trends ad relationships. Distill down the information so that each component of activity that is of interest is presented as attributes of that person.
The table below shows some combined information as it relates to the Top 10 and Bottom 10 thread topics from an association’s online community. Using the information, we can see what a correlation may exist between a person’s attributes and the most active threads. From the data below, it looks like younger individuals with less membership tenure and professional development are replying and posting to threads generated by younger authors than the bottom threads. Perhaps action can be taken to target the younger members with messaging encouraging them and providing the benefits of authoring and responding to community posts.
Once you combine data, you can determine if there is actually a relationship between two data sets. You can also see if you need additional data to augment your analysis. Using business intelligence tools, like Tableau, allows you to easily connect data sets and experiment.

How Your Association Can Implement Propensity Modeling

Last week, we introduced you to Propensity Modeling and how it can help your association make data-guided decisions while providing great value to your customers. We’ll now dig into some of the technical detail and steps to implement Propensity Modeling.

Step 1. Prepare Your Data

Consistent, complete, and accurate data is the foundation of predictive modeling. Your data should ultimately look like a very wide row with a dependent variable of 1 or 0 relating to the business action taken (or not) along with a variety of independent variables with values at the time of transaction.
Categorical data should be converted to “dummy” variables where values are transformed into individual columns as opposed to row-based data that is ideal for data exploration.  Fortunately, the ability to quickly access high-quality and timely data regardless of source from an environment such as a dimensional data model makes the process much easier.

Step 2. Select Your Variables

Incorporating the right mix of features is vital to the success of any predictive model. While it’s great to have many variables available as candidates, having too many can actually harm model accuracy.
Several automated stepwise techniques are available to propose variables by iterating through different combinations while considering measures such as significance and model error. Simply relying on automated processes is not recommended as statistics should be tempered with business expertise to identify variables that are not meaningful or pick between highly correlated variables. Another challenge is the potential for over fitting, meaning the selected variables based on the sample data are not best for unseen data.

Step 3. Select Your Modeling Technique

Next, you will want to select a modeling technique. You will likely be deciding between a linear regression model and a logistic regression model.
Linear Regression models have outcomes based on nearly infinite continuous variables, such as time, money, or large counts. Propensity Modeling generally leverages Logistic Regression models to derive probability-based scores between a fixed range of 0 and 1. The underlying algorithms used to create models are very different as well.
Logistic Regression is often perceived as an approach to estimate binary outcomes by rounding to 0 or 1, but a score of .51 is very different from a score of .99.  A common approach is to assign records to categories using deciles, or 10 bins with equal ranges.

Step 4. Determine If You Need to Use Any Other Analytic Techniques

You can use several other advanced analytic techniques to accomplish goals similar to Propensity Modeling.

  • Clustering is a form of unsupervised learning as the model is not based on a specific outcome or dependent variable, but simply groups records such as individuals.  The groups can result in customer segments that are ideal for certain products or marketing approaches.
  • Collaborative Filtering is based solely on the actions of groups of users as opposed to individual characteristics.  This is a common approach for recommendation systems based on actions such as purchases, product ratings, or web activity.
  • Decision Trees traverse a path of variables with branch “splits” based on the contributions of variables to ultimate outcomes.  This technique can be effective when a very small set of variables lead to outcomes influenced by downstream groups of variables.

You can also combine models, where the results of one model are the input to another to create a ensemble models.
The decile scores generally represent a range from “sure thing” to “lost cause”. You can use the different decile groups to guide approaches such as the effort to retain individuals, pricing strategies, and marketing messages.

Step 5. Determine Measurement Approach

The Lift of a Propensity Model represents the ratio of the rate based on applying a model to the rate based on “random” individuals. An ideal way to derive this measure is to maintain a control group for comparison to a similar group leveraging the Propensity Model. If can be a difficult decision to risk potential revenue, so a common approach is to simply compare before-and-after results.

Step 6. Consider How You Will Take Action

Before using any analytics model, it’s a good idea to consider how you can take action on the information. What decisions will you make as a result of the information? Similarly, how will you measure the results of the action and use it to inform your model?
For example, you can use a propensity model to reduce expenses. Targeting individuals differently based on their propensity to take action can optimize costs in different ways. Costs might be direct costs, such as actual print mailings or list rentals, or costs can be indirect, such as many non-personalized emails that contribute to information overload. You will want to establish a baseline and a goal for cost reduction to measure success of the model.

Step 7. Identify Your Tool

A range of different options are available to implement Propensity Modeling.

  • R Programming: A popular open-source statistical programming language with many mature packages to perform the techniques underlying Propensity Modeling.
  • Alteryx Software: A software platform offering pre-built tools for different modeling techniques and business scenarios.
  • Amazon Machine Learning: A cloud-based service that is part of the comprehensive Amazon Web Services environment that provides visual wizards for tools to perform Propensity Modeling

This may seem like a lot of steps, but once you have all of your comprehensive data easily accessible along with an available user-friendly tool, all you will need is your imagination to better understand your association’s customer journeys to make valuable data-guided decisions.

How to Harness the Power of Recommendation

Taking a customer-focused approach to data analytics helps provide optimal value, enhance engagement and understand the overall customer journey. Individuals’ actions provide valuable information that goes further than what is collected with surveys and online profiles. Additionally, actions uncover hidden patterns that can be used to build a recommendation system to guide customers toward other interests.
Here are the most common approaches to creating recommendation systems:

  • Collaborative filtering. This is based on data about similar users or similar items. It includes these techniques:
    • Item-based: Recommends items that are most similar to the user’s activity
    • User-based: Recommends items that are liked by similar users
  • Content-based filtering: Makes suggestions based on user profiles and similar item characteristics
  • Hybrid filtering: Combines different techniques

Recommendation systems results are similar to those on sites that suggest products and people, like Amazon and LinkedIn. Collaborative filtering leads to more of a self-learning process, since it is entirely based on actual activity and not data provided by users. There are scenarios where the others are more appropriate that we’ll address soon.
Similarity between users or items is measured by “distance” calculations from those long-ago geometry and trigonometry classes. You can use the results with a visualization tool such as Tableau, creating a similarity matrix and quickly identifying relationships.
It is sometimes helpful to group individuals and items into categories, which can be done by combining similarity scoring with data mining techniques like cluster analysis and decision trees.
Recommendation systems generally require data structured by columns instead of the row-based data that is best for interactive data discovery. Similar to text analytics, the items themselves — meetings, publications, donations, and content — represent large columns. It’s used by specialized R packages for the recommendation system features described in this book.
These algorithms generally need binary values, like a “yes” if someone purchased an item and “no” if he did not. But if users can rate items on a scale of 1-5, what does a score of 3 mean? Normalizing scores based on individual and overall ratings is a good way to answer this question.
The data requirements are really not as onerous as they may sound. Once data is in the right format for the R analysis tools, your imagination can take over to drive actionable association analytics. Content-based filtering works well for new users, and a hybrid approach can help prevent a “filter bubble” where some people get a too-narrow set of interests from similar recommendations.
Data from meeting registrations, membership history, donations, publication purchases, content interaction, web navigation, survey responses and profile characteristics can be used to guide association customers. Additionally recommendations can bring people with common interests together. This new insight can be used to enhance all customer interactions, ranging from email marketing to dynamic website presentation to event sessions.

Mind Your Ps and Qs, But Be Sure to Consider R

The analytics cycle described by Gartner depicts moving beyond descriptive analytics towards analytics that answer questions of “why did it happen?”, “what will happen?”, and “how can we make it happen?” Value and difficulty both increase throughout this natural progression.
R is easier and faster to use than traditional programming languages like C++ and Java.  R is open source and was created in 1993 specifically for statistical computing.  One reason it is efficient and easy to use is because of its well-documented functions which don’t require complicated programming code such as iterative loops. Analysts can access and test functions one at a time on the fly using tools such as RStudio and also have the ability to develop a series of functions to save and execute as a statistical application.  In other words think of the SQL Query Tool which allows you to write SQL and test your results interactively, vs. a SQL stored procedure which may be called once to execute a series of statements.
In technospeak, R leverages data structures based on vectors and matrices which are key to the underlying mathematics used for data mining, optimization, and machine learning.
Why is this important for associations?  R was recently ranked as the 6th most popular programming language by an annual IEEE study, and was also the biggest mover of the top 10 languages.  Knowledge of R is rapidly increasing because it is heavily used in a variety academic programs such as social sciences.  As new talent enters the association workforce, applicants with this skill set will be in high demand.
Popular analytics tools commonly used by associations are integrating R into their core products. Tableau provides easy access to analytics using R functions with calculated fields. Microsoft recently acquired Revolution Analytics, a leading commercial provider of software and services based on R, and announced plans to integrate R into the core SQL Server product. These and other analytics applications can already leverage the power of R through integration with extract, transform, and load (ETL) scripting to incorporate timely advanced analytics into a comprehensive data layer.

Capabilities of R can help answer common association business questions:

  • What other products tend to be purchased with memberships?
    Association rules, commonly known as Market Basket Analysis, provide measures of such products using easily understood statistics
  • What characteristics are related to overall membership revenue and how much will individual members likely spend?
    A variety of regression models estimate both yes/no outcomes and dollar amounts along with the most impactful characteristics.
  • How can we categorize members and prospects by level of engagement?
    Classification algorithms such as decision trees identify unique characteristics that define different segments of individuals and assign them into groups such as low, medium, and high.
  • Did that website change or marketing campaign have an impact?
    “A/B testing” using hypothesis testing informs if changes in behaviors as a result of efforts such as website changes and marketing campaigns exposed to different groups are significant.
  • What do groups of our members have in common?
    Undirected data mining such as clustering identifies characteristics of similar member groups that potentially contribute to different individual behavior.
  • What are common and related words in blog posts, social interaction, and publishing content?
    Text mining identifies words, phrases, and concepts that describe content along with those likely to be found together to help understand interests to guide association product offerings.

Predictive analytics using R can help you understand the story in your data and help association staff make decisions with confidence!

Introduction to Data Mining for Associations

The TV show, Parks and Recreation, recently had an episode where a huge tech company used data mining to the extreme– knowing what would be the perfect present for every person in the town. In the episode it caused a huge stir due to privacy concerns. And I am sure most of you remember the incident where Target predicted a girl’s pregnancy before her family even knew. Although data mining is getting a bad rap lately, when done in a way that respects the privacy of your customers, it can provide valuable information for your association.
What is it?
Data Mining is just using computational power to comb through large amounts of data in order to find patterns and uncover meaningful information.
What does it involve?

  1. Classification – This takes your data and determines what bucket it goes to. Suppose you ran a model on your data to determine who is likely to renew or who is likely to drop their membership. Classification would then take this model and break out your members and put them in buckets depending on how likely they are to renew.
  2. Factor Analysis – A useful tool for investigating relationships within complex areas such as socioeconomic status, dietary patterns, or psychological scales. It allows researchers to investigate concepts that are not easily measured directly.
  3. Regression analysis – is used to understand which among the independent variables (e.g. job sector, meetings attended, etc.) are related to the dependent variable (e.g. member retention), and to explore the forms of these relationships. Regression analysis is also widely used for predicting and forecasting.
  4. Segmentation – Most of us do segmentation all the time. You can segment your organizations by number of employees or individuals who have similar job titles.
  5. Association – If you have been to Amazon you have seen how association works. Let’s say I buy a NutriBullet and a FitBit, an algorithm runs to see other people who have bought those items and what else they buy. Then Amazon will recommend those products to me.
  6. Sequence Analysis – This type of analysis looks for patterns of values that happen in sequence. For example, if you were looking how your customers interact with your website, you can see if accessing web pages in a certain sequence means they are more likely to purchase products.

How do I do it?
Now that you understand all the different types of analysis that can be performed while data mining, the next step is to understand how you can get started.

  1. Define the problem you want to solve- As smart as most computers are, they won’t be able to answer a question when none is asked. First determine what question you want to ask, for example, “Who is most at risk for not renewing membership?” or “What is the profile of our optimal customer?”
  2. Determine what analysis is appropriate for your question – Now that you have defined what you need to ask, you need to determine which of the 6 analysis types above will best help you answer your question.
  3. Prepare your data – Now you will need to find and clean your data. You can only analyze data you have access to and the answers you will get from analysis will only be good if your data is clean.
  4. Training – Just like a person, your model will need training as well. Think of this like giving a new employee a tour when they first start. They meet all the staff and get to find out how your association works and what everyone does.
  5. Testing – Once you run an analysis on your data and a model is created, you then make sure that the model is accurate. The best way to do this is by using historical data and run it through the model to see if it accurately predicts what has already happened.

Data mining is an advanced part of Business Intelligence and should be an end goal for any association analytics initiative. It allows you to take your most valuable asset, data, and use it to help you not only in the present but also help you plan for your association’s future.