With AI no longer science fiction, associations are using advanced technologies to convert mountains of data into actionable insights.
At the recent EMERGENT event, hosted by Association Trends, we had the opportunity to jointly present case studies with ASAE’s Senior Director of Business Analytics, Christin Berry.
These success stories include how ASAE has:
Combined artificial intelligence and text analytics to enhance customer engagement, understand evolving trends, and improve product offerings
DOUBLED online engagement with unique open and the click-to-open rates using AI to personalize newsletters
Reduced the need for surveys, identified what’s trending, and measured through Community Exploration
Leveraged Expertise Search and Matching to better identify experts and bring people with similar interests together
I’m Matt Lesnak, VP of Product Development & Technology at Association Analytics and I hope to demystify these emerging technologies to jumpstart your endeavors in association innovation.
Text and Other Analytics
Associations turn to analytics and visual discovery for answers to common questions including:
- How many members to we have for each member type?
- How many weeks are people registering in advance of the annual meeting?
- How much are sales this year for the top products?
Questions about text content can be very different, and less specific. For example:
- What is it about?
- What are the key terms?
- How can I categorize the content?
- Who and where is it about?
- How is it like other content?
- How is the writer feeling?
It is widely estimated that 70% of analytics effort is spent on data wrangling.
This high proportion is no different for text analytics but can be well worth the effort. Text analytics involves unique challenges including:
- Term ambiguity: Bank of a river vs. a bank with money vs. an airplane movement
- Equivalent terms: Eat vs. ate, run vs. running
- High volume: Rapidly growing social data
- Different structure: Doesn’t really have rows, columns, and measure
- Significant data wrangling: Must be transformed into usable format
Like the ever-growing data from association source systems that might flow to data warehouse, text content of interest might include community discussions, articles or other publications/books, session/speaker proposals, journal submissions, and voice calls or messages.
Possible uses include enhancing your content strategy, providing customized resources, extracting trending topics for CEOs, and identifying region-specific challenges.
ASAE is working with rasa.io to automatically identify topics of newsletter content as part of a pilot that significantly improved user engagement. ASAE and rasa.io first tracked newsletters interactions over time to understand individual preferences and trending topics. Individuals then received personalized newsletters based on demonstrated preferences.
The effort had been very successful, as unique open and the click-to-open rates have more than doubled for the personalized newsletters.
Underlying technology includes Google, IBM Watson, and Amazon Web Services; combined with other machine learning tools developed by rasa.io.
ASAE leverages a near-real-time integration with over 10 million community data points combined with enterprise data warehouse to analyze over 50,000 pieces of discussion content and over 50,000 site searches. The integration is offered as part of the Association Analytics Acumen product through a partnership with Higher Logic.
Information extracted includes named entities, key phrases, term relevancy, and sentiment analysis. This capability provides several impactful benefits.
- Visualize search terms
- What’s trending
- Staff and volunteer use
- Reduce need for surveys
- Aboutness of posts as content strategy
- Identifying key expertise areas
- Connecting like-minded individuals
Underlying technology includes AWS Comprehend, Python, and Hadoop with Mahout.
Expertise Search and Matching
Another application of text analytics that we’ve implemented involves enabling associations to better identify experts and bring together people with similar interests. In addition to structured data from multiple sources, text from content including meeting abstracts and paper manuscripts provides insights into potential individual interests and expertise.
This incorporates data extracted from content using approaches including content similarity, term relevancy, validation of selected tags, and identifying potential collaborators.
Underlying technology includes Python and Hadoop with Mahout.
Approaches and Technology
We’re written extensively about the importance of transforming data into a format optimized for analytics, such as a dimensional data model implemented as a date warehouse.
Thinking back to the common association questions involving membership, event registration, and product sales; these are based on discrete data such as member type, event, and day.
Text data is structured for analysis using a different approach, but fundamentally similar as each term is a field instead of, for example, a member type table field.
Picture a matrix with each document as a row and each term as a column.
This is referred to as “vector space representation”. With thousands of commonly used words in the English language, that can be a big matrix. Fortunately, we have ways to reduce this size and complexity.
First, some basic text preparation:
- Tokenization – splitting into words and sentences
- Stop Word Removal – removing words such as “a”, “and”, “the”
- Stemming – reduction to root word
- Lemmatization – morphological analysis to reduce words
- Spelling Correction – like common spell-checkers
Another classic approach is known as “Term Frequency–Inverse Document Frequency (TF-IDF)”. We use TF-IDF to reduce the data to include the most important terms using the calculated scores. TF-IDF is different from many other techniques as it considers the entire population of potential content as opposed to isolated individual instances.
It is widely estimated that 70% of analytics effort is spent on data wrangling. This high proportion is no different for text analytics but can be well worth the effort.
Other key foundational processing:
- Part-of-Speech Tagging: Noun, verb, adjective
- Named Entity Recognition: Person, place, organization
- Structure Parsing: Sentence component relationships
- Synonym Assignment: Discrete list of synonyms
- Word Embedding: Words converted to numbers
The use of Word Embedding, also referred to as Word Vectors is particularly interesting. For example, the word embedding similarity of “question” and “answer” is over 0.93. This isn’t necessarily intuitive and it is not feasible to manually maintain rules for different term combinations.
A team of researchers at good created a group of models known as Word2vec that is implemented in development languages including Python, Java, and C.
Here are common analysis techniques:
- Text Classification: Assignment to pre-defined groups, that generally requires a set of classified content
- Topic Modeling: Derives topics from text content
- Text Clustering: Separating content into similar groups
- Sentiment Analysis: Categorizing opinions with measures for positive, negative, and neutral
Finding and Measuring Results
With traditional data queries and interactive visualizations, we generally specify the data we want by selecting values, numeric ranges, or portions of strings. This is very binary – either the data matches the criteria, or it does not.
We filter and curate text using similarity measures that estimate “distance” between text content. Examples include point-based Euclidean Distance, Vector-based Cosine Distance, and set-based Jaccard Similarity.
Once we identify desired content, how do we measure overall results? This is referred as relevance and is made up of measures known as precision and recall. Precision is the fraction of relevant instances among the retrieved instances, and recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. The balance between these measured is based on a tradeoff between ensuring all content is included and only including content of interest. This should be driven by the business scenario.
This overall approach to text analytics is like that used for recommendation engines based on collaborative filtering driven by preferences of “similar” users and “similar” products.
APIs to the Rescue
Fortunately, there are web-based Application Programming Interfaces (APIs) that we’ve used to help you get started. Here are online instances from Amazon and IBM for interactive experimenting:
- Amazon Web Services (AWS): https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#home
- IBM Watson: https://watson-api-explorer.ng.bluemix.net/
This is a lot of information, but the takeaways are they there are big opportunities for associations to mine their trove of text data and it is easy to get started using web-based APIs to rapidly provide valuable insights.
Matt Lesnak, VP of Product Development & Technology