Introduction
The modern computing era has created widespread challenges for humanity, both within technology industries and for the general global culture. Enormous quantities of text data are being generated by social media platforms, surveys and public research firms, public relations, advertising, and retail. At Within3, we manage insight and communication management platforms that facilitate conversations between pharmaceutical companies and the healthcare providers that prescribe their drugs. Managing and understanding high volumes of free text feedback from healthcare providers is critical for understanding and managing unexpected problems that may arise even after a drug is out of clinical trials and in use.
In data science, text data is notoriously more difficult to study than numerical data. Not only is text more variable, filled with nuance, and dependent on context, but the correct answer to a question can vary with the audience answering it. Writing software to analyze text data and produce a reliable and actionable answer when we cannot always even agree on the correct answer is inherently problematic. Nevertheless, understanding high volume textual feedback remains critical for organizations to determine the impact of business decisions and identify blind spots. If healthcare providers are reporting a rare side effect of a particular drug within some context, the sooner the pharmaceutical company producing that drug can address the issue in their research pipeline the sooner they can resolve it. If a survey is soliciting a text response to a question and receiving thousands of answers, having an automated and data-driven solution for identifying common answers and structure within the responses is critical. If customers buying a particular brand of bed sheets are repeatedly writing public product reviews saying that the sheets came out of the package with a horrendous smell, the manufacturers would want to know that without having someone manually read through thousands of reviews before noticing.
Common techniques for analyzing text data often include the identification of keywords, supervised classification models, and vector-embedding models (i.e. "Word2Vec" and similar methodologies) that assign vectors to a terminology which can facilitate algebraic manipulations of vector representations of text data. While these can elicit some useful information about the lexicon and topics in a text dataset, none of them truly address the question everyone really wants to know:
"What are the most common ideas being expressed in this data?"
We have been developing a methodology to address this question directly. One critical component of such a process is to start with a text data input of multiple statements that communicates roughly the same "idea" (to be defined more precisely below) with different exact wording and "condense" them into a single output statement that best represents the whole of the inputs.
For an introductory pedagogical example, if we have the following four simple text reviews of some coffee:
"The coffee tastes excessively bitter and acidic."
"I do not like how bitter the coffee is."
"This coffee is unpleasantly acidic."
"The taste is too acidic."
how do we output a single "answer" such as
"the coffee is unpleasantly acidic"
which best represents the idea being expressed in the inputs?
Using the existing "natural language processing" (NLP) techniques of "part-of-speech" tagging (POS tagging) and "dependency parsing", we introduce a technique that we denote "Idea Condensation" to perform this functionality. Multiple similar ideas expressed in text are "condensed" into a single expression that best represents the whole of the inputs.
What is an "idea"?
The field of NLP involves the study of language data with mathematical and statistical techniques; if we can translate a language dataset into a numerical form, then we can use existing quantitative analysis techniques to describe that data. Before we can address what an "idea" is for our purposes in this study, we need to address some common techniques in NLP that allows us to detect structure in text data.
n-grams
A "term" in English NLP is typically a single word ("the", "cat", "runs", "away", etc.), but can also refer to compound words like "living room", "full moon", "Boston Globe" (a newspaper), or even the term "natural language processing" (a three-word compound word) itself. "Boston Globe", for example, is a singular noun that refers to a newspaper and so a single "term" apart from "Boston" (a city) and "Globe" (a representation of the Earth found in classrooms). Identified terms are also often referred to as "n-grams" in NLP, where "n" refers to the number of words comprising the term. "Cat" is a term comprised of one word and so can be denoted a "unigram"; "Boston Globe" is comprised of two words and so can be denoted a "bigram"; similarly, "natural language processing" is a "trigram" (we rarely, if ever, consider n > 3 terms).
Tokenization
Next, "tokenization" in NLP is, in its most general form, the process of taking a sequence of textual characters and partitioning them into a sequence of pieces, where each piece itself is denoted a "token". "Term tokenization" is then the more specific case of taking a sequence of text data and breaking it into a sequence of individual terms by our definition above. Bigrams and trigrams are often identified by count statistics of the words comprising them occurring together and setting an arbitrary threshold. For example, suppose we are considering a document in which the terms "Boston" and "Globe" each occur five times, but every time they do occur it is in the sequence "Boston Globe", then we might want to consider that a bigram for the newspaper "Boston Globe" and not references to city or spherical map.
Part of Speech (POS) Tagging
With term tokenization, we can translate the string of characters
"The cat runs through the Boston Globe."
into a one-dimensional array of terms
["the", "cat", "runs", "through", "the", "Boston Globe"]
Now we can tokenize some text data into a sequence of terms. Part-of-speech tagging (POS tagging) is the technique of assigning labels to individual terms that describe their function in a sentence. In the sentence that we just tokenized, "cat" would be tagged and labeled as a "noun", "runs" is tagged as a "verb", and so forth. So, our term tokenized sequence of text data might then look something like
[("the", "determiner"), ("cat", "noun"), ("runs", "verb"), ("through", "preposition"), ("the", "determiner"), ("Boston Globe", "noun")]
In practice, a noun would be tagged with a shorthand like "NN" instead of "noun", a verb would be tagged "VBZ" instead of "verb", and so forth, but we need not delve into that here. We will be using the SpaCy POS tagger in this study.
Dependency Parsing
With text data tokenized and POS tagged, we turn next to the topic of how terms in text data relate to each other. The term "runs" is an action that is being performed by a noun "cat". Since the "cat" (noun) is performing the action "runs" (verb), the "cat" is the "subject" of the verb "runs". The verb "runs" depends on the noun "cat" to make sense. Verbs not only have subjects performing actions, they also have "objects" on which that action is performed.
Note that "the cat runs" seems like it just barely makes the cut for an idea because it is reflexive – both the subject and object of the verb are "cat". The fully formed thought is "the cat runs [itself] through the Boston Globe".
The verb "runs" also happens somewhere in this idea. The running is happening "through" (a preposition) the "Boston Globe" (the object of the preposition).
Note that in English "prepositions" and "postpositions" are collectively denoted "adpositions" and so the preposition "through" will be assigned the POS tag "ADP" below. Because the object of the preposition "Boston Globe" follows the adposition "through", it is more specifically a "preposition". If the object of the adposition precedes the adposition, it is a "postposition".
All the terms of this sentence form an idea because there is an inherent relationship structure among them. In computational linguistics, this set of relationships is called a "dependency tree". Here’s an example of a visualization of our sentence with the dependency relationships illustrated

The process of identifying these relationships is called "dependency parsing". We will be using the dependency parser of the SpaCy python package.
Defining an "Idea" for NLP
With these techniques, we can address the problem we’re trying to solve. When a client comes to us with a set of text data and asks us "what are the main ideas that the responders are telling us?", we must first ask ourselves "what is an ‘idea’ in NLP?". We could not find any standard definition for an "idea" from an NLP perspective, so we’re proposing one.
Intuitively, an "idea" needs to be more than just any single term. "Cat" is just a noun and "runs" is an action or verb; neither fits an intuitive notion for a fully formed idea. "Cat runs" almost like an idea and "the cat runs" seems like the bare minimum of a singular, fully formed thought. Pairs of words, like "white dog" (and adjective and the noun that adjective describes, absent any action) or "eats food" (a verb and the object noun that the verb is acting on, absent the subject noun performing the verb) also do not quite make the cut for a fully formed idea.
Compound sentences, however, intuitively seem to go too far. Consider
"The cat runs through the Boston Globe and then he hides in a box."
This sentence is still fairly short, but contains two fully formed ideas: one idea of an animal running though a place and another of the same animal hiding in a more specific place. Sentences, in general, are too long to represent singular ideas.
So, to represent a singular idea, we’re looking for something more than individual terms or even pairs of terms, but something less than a general full sentence that frequently contains more than one idea. We settled on representing an "idea" as an English language clause, whether dependent or independent, for our purposes in NLP. "Clauses are units of grammar that contain a predicate (verb) and a subject (noun)."
Thus, with our term tokenizer, POS tagger, and dependency parser we identify singular ideas in text data by starting with a single verb and its subject and object nouns. Those nouns and verbs may also then have adjectives or adverbs describing or modifying them and so should be included. Further, adpositions that relate to verbs, such as in our example above, should also be included. An "idea" will be defined by a sequence of terms matching this pattern using the POS tagging and dependency parsing just described.
#
The difficulty with managing and studying text data is the inherent ambiguity. A particular idea can be expressed in countless different ways, with slight variations from one expression to the next. With our definition of an idea for NLP purposes, we address in this essay a narrow application. If we have as input a series of idea-sized strings of text data that all express a similar idea with variations in the particular wording – and putting aside the issue of how that series of text was obtained, for the moment – then how can we output a single idea-sized string of text that best represents the whole of the input?
We will show a methodology to address this problem that utilizes only POS tagging, dependency parsing, and count statistics of their results. Note some principles guiding our approach:
- All-or-nothing approaches should be avoided. If a full, simple-sentence description cannot be found, then lesser grammatical constructions that are identifiable can be returned, even if they don’t quite fully match our NLP definition of an "idea". There is a ranking order from highest quality acceptable output to lowest quality acceptable output.
- If there are ties for the highest ranked quality output, all results are returned.
- Text data is noisy so any methodology attempting to solve this problem needs to be resilient to noise in the input.
- Two forms of output are returned:
-
- The highest quality grammatical construction that can be automatically identified from the input
-
- The single input string that best represents the the full series of input.
The methodology begins by POS-tagging the input. Then for each tagged part of speech, the most common term is identified. That is, the most common noun, the most common verb, etc., is identified for the input.
Initial Example
Let’s introduce our idea condensation methodology with a simple, contrived example. Suppose that our input consists of the following four simple sentences. All four sentence say something roughly similar, but with slight variations in wording and meaning.
"The red fox jumps over the lazy dog."
"The red fox jumps."
"The fox jumps. "
"The fox jumps over the dog."
POS Tagging in Idea Condensation
Our POS tagging detects the following counts

We note from the POS tags the most common term for each part of speech

Dependency Parsing in Idea Condensation
Now that we have the most common terms for a few different parts of speech, let’s see if we can automatically detect any relationships among them with dependency parsing. In general, such relationships are not guaranteed to be found. If the inputs consist of entirely different statements with different terms, then it is less likely that dependency parsing will find common relationships among the most common terms for each identified part of speech. The more similar the inputs statements are, however, the more likely we are to find common relationship among the different most common terms for the different parts of speech, allowing us to automatically construct more complex idea-sized statements about the input as a whole. We therefore need to be clear that this idea condensation methodology is predicated on the assumption that work has already been done to ensure that the inputs are similar statements. Idea condensation will be one component of a larger methodology for summarizing the ideas in a text dataset.
Let’s look at a graphical representation of the dependency trees for each of the four inputs of this first example.
"The red fox jumps over the lazy dog."

"The red fox jumps."

"The fox jumps."

"The fox jumps over the dog."

Dependencies that point towards a term of interest are in the "head" of that term. Dependencies that point away from our term of interest are in the "children" of the term. Starting with the most common noun, "fox", the most common dependency that is pointing to "fox" is the verb "jumps". So, the term "jumps" is the most common term in the "head" of the most common noun "fox", but "jumps" just happens to also be the most common verb. Therefore, we have determined one relationship between terms that describes the whole of the input. "fox jumps" is a construction we can automatically make by virtue of the facts that (1) "fox" is the most common noun of the input; (2) "jumps" is the most common verb of the input; and (3) the most common verb "jumps" is the most common term in the head of the most common noun "fox".
Let’s continue by looking at the children of the most common noun "fox". The determiner "the" appears in the children of "fox" twice and the adjective "red" also appears in the children of "fox" twice. The term "red" also just happens to be the most common adjective. Therefore, "red" describing "fox" should also be considered descriptive of the input as a whole and we can construct "red fox" or "fox is red".
Finally, putting the two results together, since the most common noun "fox" has the most common verb "jumps" in its head and the most common adjective "red" in its children, we could automatically construct
"the red fox jumps"
as a single idea-sized statement that best represents the whole of the four input statements. In other words, we have "condensed" four ideas into a single idea representing the input data. (The fact that the output idea just happens to exactly match one of the input statements is coincidental).
In our idea condensation methodology, we do not want to rely solely on automatic constructions. We also want to identify which of the particular input statements best represents the whole of the input and then return that example as well.
At the moment, we achieve this by returning the individual input statement(s) that contain the highest number of the most common terms for the different parts of speech. In this example, if we are considering only the most common noun ("fox"), verb ("jumps"), adjective ("red"), and adverb (none identified), then we want to identify which (if any) inputs contain all of "fox", "jumps", and "red". Among the four inputs, two actually contain all three of these terms
"The red fox jumps over the lazy dog."
"The red fox jumps."
Since there’s a tie between these two, both are returned.
The final output for this example is then,
Best constructed answer:
"the red fox jumps"
Best Representative Example From the Input:
"The red fox jumps over the lazy dog."
"The red fox jumps."
A Second Example
Let’s look at one more constructed example with a small number of input statements before incrementing towards more realistic examples.
"The angry baboon viciously bites the poacher."
"The baboon is angry and really fighting hard."
"Now the baboon is running away."
This example has a greater number of parts of speech since there are adverbs and the statements are a little less similar to each other. We mentioned above that one of the principles guiding the development of our methodology is that we do not want an all-or-nothing approach. These three statements all mention a "baboon", but different actions are happening in relation to it in each of the three statements. We therefore may not be able to construct a full idea-sized statement from the inputs, but we also do not want to return nothing. The best representation of the whole of the input that can be identified should be returned. Subjectively, we can see that a description of the baboon is common to all of the inputs, even if the related actions are different. So, our methodology should be able to return at least something about the common elements of the input.
Let’s see what our methodology can do. Our POS tagging identifies

The most common terms for the different parts of speech are

Now let’s look at the dependency parsing for each of the three inputs.
"The angry baboon viciously bites the poacher."

"The baboon is angry and really fighting hard."

"Now the baboon is running away."

Starting again with the most common noun "baboon", there is a three way tie for the most common term in the head of "baboon": "bites", "is", and "running". We should note that while "is" is the most common verb, only one use of "is" compliments "baboon" (2nd input) while the other compliments "running" (3rd example), which is why "is" in in a three way tie with "bites" and "running".
Among the three terms in this tie, "is" is also the most common verb and so we can relate "baboon" to "is" to construct "The baboon is".
The most common adjective is "angry" and the most common term in the head of angry is "is" (the most common verb) as well as "baboon" (the most common noun). So, we can also connect the adjective "angry" to "baboon".
Our best constructed output is then
"The baboon is angry".
It encompasses the parts of the statements that are common while leaving out the extraneous parts of the inputs that are not repeated.
For the best representative input, there is only one that contains the most common noun, the most common verb, the most common adjective, and one of the most common adverbs:
"the baboon is angry and really fighting hard."
The final output for this example is then,
Best constructed answer:
"The baboon is angry".
Best Representative Example From the Input:
"the baboon is angry and really fighting hard."
A Bitter Coffee to Swallow
In this third example, we would like to illustrate how this methodology is robust against noise in the input and go further to show how the constructed output is not limited to forming a single idea-sized text statement. Consider an input dataset comprised of the following statements about coffee with a couple unrelated statements mixed in.
"The coffee tastes excessively bitter and acidic."
"I do not like how bitter the coffee is."
"This coffee is unpleasantly acidic."
"The taste is too acidic."
"I am uselessly commenting some other unrelated statement."
"This is another noise comment."
The POS tagging and dependency parsing find

Our automated output construction actually looks for a wide variety of possible grammatical constructions and outputs all with a priority ordering. The fullest and most completely "ideas" are given highest priority while simpler constructions, like merely a noun and an adjective describing it, are given lower priority.
For this input, the full constructed output best describing the whole of the input is:

Meanwhile, the one statement in the input that is detected to best describe the whole of the inputs is:
"this coffee is unpleasantly acidic."
The fact that the two forms of output agree so closely, even when extraneous noise is present in the input, speaks to the power and effectiveness of the idea condensation methodology we are presenting.
A Breath of Fresh Coffee
In our fourth example, we constructed a dataset comprised of a baker’s dozen sentences copy-and-pasted from reviews of an actual coffee product on Amazon. Although we did not write these statements directly for pedagogical purposes like we did in the first three examples, we did hand pick single-sentence statements mostly related to how the coffee smelled. In practice, people often do not even write in complete sentences or with close to correct grammar, so having a methodology that is resilient to less-than-ideal conditions is critical.
"This coffee has a really pleasant aroma."
"This one really smelled great."
"Great smell and taste!"
"I picked up on some great aromatics when I first bought this."
"Morning brews have a lighter scent."
"It really emanates an evanescent smell that fills my house and company loves."
"I love waking up to the smell of this coffee."
"Freshness and that wonderful aroma is great!"
"Love this coffee!"
"Great bold smoky, almost chocolaty taste with sweet aromas and no sourness."
"The crema is magnificent. It has a velvety flavor and nice aroma."
"The aroma, flavor, and smoothness is unmatched."
"Best espresso beans ever!"
The POS tagging and dependency parsing find

The constructed output from idea condensation is

The particular inputs that best represent the whole of the input dataset are (there was a tie for first place):
"I picked up on some great aromatics when I first bought this."
"Freshness and that wonderful aroma is great!"
Final Example
In out final example, let’s consider an input that was not as contrived. Although we have explicitly addressed the assumptions under which idea condensation is applicable, we have not yet addressed the broader context of how and when we might actually use it, in practice.
There are a myriad of situations where we find ourselves with a dataset comprised of many paragraph-length text responses written by people in response to some common topic. They could be answers to a particular survey question or reviews of a particular product or service. The question everyone wants to know the answer to is "What are the most common ideas people are expressing to us in this dataset?"
Keywords and classification models do not answer this question. As such, we attempted to answer the question as stated and searched for a solution that yields the "ideas" (as defined above in this essay) inherent to the structure of the input dataset, assuming we do not know anything about the answers and want to find that which we cannot predict. That’s why we started with an unsupervised training approach. Unsupervised training techniques are designed to identify the structure inherent to a dataset. When applied to text data, and unsupervised training approach can identify clusters within the dataset that have similar wording.
Coffee and Bed Sheets
In that spirit, we constructed a dataset of 500 paragraph-length user reviews copied from Amazon. 250 of the reviews were related to a particular brand of coffee and 250 were related to a particular brand of bed sheets. 70% of the coffee reviews were positive in sentiment and related to the smell of the coffee while the other 30% were negative sentiment reviews about the acidity/bitterness of the coffee. 70% of the bed sheet reviews were positive sentiment reviews about the sheets being comfortable while the other 30% were negative sentiment reviews about the sheets having a terrible smell straight out of the packaging.
The first step was to tokenize that dataset into clauses, since we decided above to use a clause and the grammatical equivalent to a singular "idea" in text. Next, we used a Term Frequency-Inverse Document Frequency (TF-IDF) model to create vector representations of each clause. With those vector representations, we could then perform a K-means clustering on the dataset. As a part of that K-means clustering process, we used our Factionalization Methodology to first estimate the optimal number of clusters inherent to the dataset, then ran the K-means clustering assuming that automatically-detected optimal number of clusters. With the dataset partitioned into those K clusters, we can treat the text corresponding to each cluster as the input for idea condensation.
We have described the full "idea summarization" process in another essay here.
Weird Smelling Bed Sheets Straight out of the Packaging
This input for idea condensation was automatically generated from our coffee and bed sheets development dataset – it is the text of one of the clusters generated through an automated K-means clustering process.
"i love the smell first thing in the morning"
"smell"
"horribly overburned taste and smell"
"smell and taste good"
"it is light brown in color and you can smell the acid as soon as you open the bag"
"my wife said smell good"
"smell great"
"they smell weird"
"they smell a bit inky after coming out of the package"
"also they had a slight 'new smell' kind of a bad smell"
"strong chemical smell"
"they feel weird and smell a bit like plastic when you take them out of the package"
"they smell weird"
"great smell and taste"
"strong chemical smell"
"after owning them for less than a year they have developed a musty smell"
"weird chemical smell made us choke"
"i live in southwest florida they came with really weird smell"
"it smelled like cat pee when it arrived and i had to wash it 3 times to remove the awful smell"
"the smell was still"
"there was a strong chemical smell when the package was opened (took washing them 3 times to get smell out)"
"they smell like they've been in my grandma's linen closet for 20 years"
"musty smell"
"the smell though"
"they smell terrible"
"with fabric softener and unstoppables and can not get the smell out of them"
"they smell like a combination of antifreeze and mothballs"
"i noticed quite a bit of blotchy discoloration and a strange burnt smell"
"strong chemical smell"
"horrible chemical smell even after washing"
"the terrible chemical smell is still present"
"i'm going to wash them and see if the smell goes away"
"both had this weird smell that took several washes to get rid off"
"they smell like elmer's glue"
"they smell like something rotten"
"muted chemical smell (like an garage at a gas station)"
"they smell terrible"
"when i opened it they smell like perfume/scented detergent and are stained"
"they started to smell"
"rating is 3 stars due to the chemical smell after 2 washes in the past 24 hours after receiving"
"i can't get the burnt smell out of my dryer and almost burned my condo down"
"terrible petroleum smell out of the package"
"fishy smell from the material even after multiple washings"
"hard to wash off the "new" funky smell"
"these smell like paint"
"the chemical smell was super strong when i took them out of the package"
"they smell heavily like chemicals"
"they smell bad (like petrol)"
"very weird smell"
"they still smell strongly of formaldehyde"
"they came with a really weird smell"
"when i opened i felt a strong smell and after wash the smell didn't away completely"
"don't really ever smell clean"
"weird smell"
"i'm not sure what's going on with the smell"
"these have a terrible smell right out of packaging"
"the smell was awful"
"they smell horrible"
"they smell like you just painted a room and it needs to dry"
"i've washed them multiple times to try and get the smell out"
"it's definitely an overwhelming smell"
"the smell really makes my mouth water like pavlov's dog just before my first sip"
"it really emanates an evanescent smell that fills my house and company loves"
The POS tagging and dependency parsing found

The constructed output from idea condensation is

The particular inputs that best represent the whole of the input dataset are (there was a tie for first place):
"I live in southwest florida they came with really weird smell."
"They came with a really weird smell."
Really Comfortable Bed Sheets
Let’s try another cluster of ideas identified from the same dataset and from the same execution of K-means clustering. The clustered idea-sized text and therefore the input to idea condensation is:
"these sheets are soft & comfortable to sleep on."
"these are honestly the most comfortable sheets i've ever owned."
"the sheets are pretty soft after a few washes and are very comfortable."
"these sheets are lightweight which makes them very comfortable for me."
"both the pillowcases and sheets are soft and very comfortable."
"these sheets are the most comfortable that i've ever slept in."
"the sheets are very comfortable."
"they are comfortable sheets that can withstand summer heat."
"we asked our guest what they thought of the sheets and they said they were very comfortable and soft."
"the sheets and pillow cases are comfortable and i like them a lot."
"now i have these and they are by far the most comfortable sheets i have had in a while."
"these are the sixth set of amazon basics bed sheets i've purchased and they are very comfortable."
"these sheets are soft and comfortable without becoming oppressively hot."
"these sheets are soft and comfortable."
"i was very pleasantly surprised at how comfortable and soft these sheets are."
"these are among the most comfortable sheets i've ever slept on."
"however the feel of these sheets even after washing a few times is definitely comfortable."
"bought these for my guest room for my sister when she is over and they are quite possibly the most comfortable sheets ever."
"the sheets remain comfortable and smooth."
"comfortable sheets in great colors."
"comfortable sheets that will keep you cool at night."
"these sheets are better than the 400 thread count sheets that i have gotten in the past in an effort to be more comfortable sleeping."
"these sheets are extremely soft and comfortable."
"these sheets are soft and comfortable."
"super soft and light weight microfiber sheets are the most comfortable i have ever slept on."
"if you're looking for comfortable high quality sheets at an affordable price then these are the sheets for you."
"the sheets got softer after the first wash and are very comfortable to sleep on."
"i get really warm when i sleep and these sheets are really lightweight which is comfortable for me."
"very soft and comfortable sheets."
"soft and very comfortable i have gone through a bimbos different sets of sheets and these by far are the ones i love."
"the sheets are super comfortable."
"they're actually very soft sheets and comfortable to sleep on."
"i got these sheets with minimal expectations and was blown away by how comfortable they are."
"these sheets are comfortable and soft."
"these are the most comfortable sheets or fabric for that matter that you will ever experience."
"comfortable sheets."
The POS tagging and dependency parsing found

The constructed output from idea condensation is

The particular inputs that best represent the whole of the input dataset are (there was an 7-way tie for first place):
"The sheets are pretty soft after a few washes and are very comfortable."
"These sheets are lightweight which makes them very comfortable for me."
"Both the pillowcases and sheets are soft and very comfortable."
"The sheets are very comfortable."
"These are the sixth set of amazon basics bed sheets i’ve purchased and they are very comfortable."
"I was very pleasantly surprised at how comfortable and soft these sheets are."
"The sheets got softer after the first wash and are very comfortable to sleep on."
Smooth Coffee Beans
While we are not presenting the idea condensation results of every cluster found in our coffee and bed sheets dataset, let’s conclude with one more to illustrate that idea condensation is not a one-hit-wonder. All the results presented in this final example correspond to a single execution of our automated text clustering algorithms.
Here is the text data from another cluster that appears to have centered around the term "smooth" in reference to coffee. This cluster of text is not as highly populated as the previous two and so there is a 16-way tie for the most common noun, each appearing just once. Still, a sensible representation is found despite a relatively noisier text data context.
"it's very smooth.
"it is very smooth and his tummy tolerates more than one cup a day."
"and for me personally really went against what i'm looking for in an espresso - smooth."
"high acidity and not smooth at all."
"very acidic and not smooth."
"it's smooth."
"it is a well balanced bean that finishes smooth."
"creamy) and extremely smooth."
"smooth."
"they're smooth."
"it is very smooth."
"they're smooth."
"than the smooth full body i am accustomed to with medium and dark roast guatamalan."
"get silky smooth and cleaned up."
"smooth texture with good amount of oils."
"silky smooth."
"it's very smooth and flavorful."
"very smooth."
"smooth mouth-feel."
"smooth."
The POS tagging and dependency parsing found

The constructed output from idea condensation is

The particular input that best represents the whole of the input dataset is:
"it is very smooth and his tummy tolerates more than one cup a day."
Hence, we were able to start with a dataset of real user reviews – albeit one in which we inserted a handful of ideas that we know we want our modeling methodology to be able to reproduce – and automatically generate an idea-sized piece of text that accurately reflects a particular cluster from within that dataset. No part of our methodology knows anything about coffee, bed sheets, the aroma of anything, or any other content particular to a specific dataset. Still, we are now able to accurately identify and output the most common ideas expressed in a large dataset of paragraph-length text as a short, digestible set of bullet points and examples – and do so in a fully automated manner.
Conclusion
We have demonstrated a technique that we developed and denote "idea condensation". The assumptions under which this technique is valid are that we have an input consisting of a series of clause-length pieces of text that express roughly similar ideas. With the proposed methodology, an automated clause-length text output can be generate which best represents the idea being expressed in the input data. Additionally, the particular entry from the series of input data that best represents the whole of the input is also identified and returned. In a sense, we are "condensing" the "ideas" expressed in the input to a single expression that best summarizes the input. The methodology is completely agnostic to the actual content of the input data. Neither model training nor the selection of a classification scheme is necessary.
Idea condensation was developed as one component of a larger text processing methodology that we have developed and denote "idea summarization". The premise of idea summarization is that we start with any arbitrary text dataset assuming that it is comprised of a high volume of entries that are very roughly paragraph length, as opposed to inputing a singular work like a novel and summarizing it (which idea summarization is not designed to handle). Idea summarization identifies clusters of ideas within the dataset, then uses idea condensation to output a short list of ideas being repeated within the original text dataset along with a representative selection of examples. If necessary, a deeper dive into the original text data associated with a particular identified idea is also possible. By developing this methodology, we empower an organization to more efficiently interpret large text datasets and identify actionable intelligence.