A Machine Learning Approach to Automated Customer Satisfaction Surveys

by Tyler Doll, Matthew Bussing, Kai Nichols, Sidney Johnson

Tyler Doll
Towards Data Science

--

While going through some of my documents, I came across a paper I was a part of in college that I thought would be perfect to share. For context, this paper was written for one of my classes where we had a project sponsored by a company. In my case, I worked with Matthew Bussing, Kai Nichols, and Sidney Johnson on a project sponsored by ProKarma to try to automate customer satisfaction surveys. This paper was originally published on our course website here. It has been adapted to fit this format and the original is embedded at the end of the article. Enjoy!

INTRODUCTION

Natural language processing (NLP) is a broad discipline concerning the problems of capturing, analyzing, and interpreting human
language. In particular, NLP can be used to evaluate human speech
in order to determine different qualities about its speaker, context,
and sentiment. These evaluations can be used to inform human-computer interaction, accessibility, automation, and statistical analysis of natural language.

This NLP project was proposed by the Edge Intelligence group at
ProKarma, an IT solutions company headquartered in Nebraska offering digital transformation and analytics consultancy. ProKarma’s
Edge Intelligence group is researching ways to improve customer
service by incorporating machine learning to augment customer
service calls in the telecommunications industry. One approach to
improve customer service satisfaction includes determining customer attitudes toward particular products and by performing sentiment and topic analysis on customer call data, effectively allowing for automated customer service surveys.

Automating customer satisfaction data collection would provide
actionable information about consumer attitudes toward products
and services at much higher response levels than simply relying on
on customers staying on the line to finish a phone survey. This automated system could run parallel to the traditional phone surveys
currently in place at many wireless carriers. Automated sentiment
analysis on products mentioned during the call could be validated
against instances where the customer finished the post-call phone
survey.

Because of the huge number of variables involved in natural language, traditional analytic methods would be prohibitively difficult
to adapt to this purpose. Machine learning algorithms allow simple linear interpolation of audio data as well as powerful pattern
finding tools. For this reason, neural networks were developed for
both topic and sentiment classification. These models find patterns
between product types and consumer attitude which provides the
information required to generate automated customer satisfaction
surveys from customer call audio.

METHODS

Three models were required to accomplish the ultimate goal of an
automated survey process: speech-to-text transcription, sentiment
modeling, and topic modeling.

  • Speech-to-text transcription: In order to automate transcription of call data, a strategy for automated speech recognition (ASR) was required. An accurate and high-performance implementation of ASR was crucial in order to perform later topic and sentiment analyses, which take plain-language transcripts as input. Transcripts were preferable to pure audio as inputs for two reasons: firstly, topic modeling cannot be performed directly on audio, as topics are strings which must be pulled directly or interpreted from within a text source; secondly, audio files can be useful in detecting vocalized emotion but cannot detect unemotional sentiment (such as a monotonous customer saying “thank you for your help” or a bubbly customer happily saying “my phone won’t turn on”).
  • Sentiment model: As sentiment analysis is important to determine customer satisfaction, it was required to develop a machine learning model to detect sentiment (positive, negative, or neutral) towards the customer service representative based on transcribed text input. ProKarma’s Edge Intelligence group previously implemented a bi-directional long short-term memory recurrent neural network (LSTM RNN) trained using open-source datasets such as the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database in order to perform emotion and sentiment classification on acoustics features extracted from audio clips. During this project, our transcript based model is intended to capture overall sentiment rather than from vocal features like the previous acoustic-based model captured.
  • Topic Model: To cross-analyze customer sentiment with topics of conversation, classification models were produced to identify concepts and specific subjects in transcripts. Trends in conversational topics can be used to understand both individual customer attitudes toward products and services as well as to understand those of the general customer base.

These models culminated in an engine for end-to-end audio analysis of customer call recordings. All project code was written using
Python 3.6, used Google’s TensorFlow machine learning library via
the Keras interface, and used the Python/R Anaconda distribution.
Version control was performed using git and GitHub was used for
hosting all project repositories. The project was developed using an
Agile workflow, and team collaboration and communication was
facilitated using the software, Slack.

SYSTEM ARCHITECTURE AND DESIGN

The aim of this project was to prove it is possible to take a customer
service call as input and get data relevant for a customer satisfaction survey as output. The system constructed to accomplish this task, pictured in figure 1, consists of three main models: a speech-to-text model, a topic model, and a sentiment model. The speech-to-text model was trained on web-scraped customer service call audio files. The topic model was trained on a web-scrapped corpus of a wireless carrier’s FAQs and forum posts. The sentiment model was trained on a combined dataset of annotated IEMOCAP transcripts and electronics reviews. The customer service call audio was then passed through the speech-to-text program to generate transcripts, which were then fed into the topic and sentiment models. These two latter models output predictions of overall topics and sentence-by-sentence sentiment tagging. These two outputs were then aggregated to produce visualizations of customer satisfaction indicators such as sentiment over time during the call and sentiment associated with specific topics.

Figure 1: Diagram of general system architecture.

TECHNICAL DESIGN

The use of Convolutional Neural Networks (CNN) for natural language
processing is a fairly recent development. CNNs are most
well known for their success in image classification from filters’
natural proficiency for identifying components of an image like
lines and shapes. Whereas a CNN used on images can successfully
find patterns in shapes, the filters of a CNN used on language is
analogously able to find patterns in n-grams.

The network used for testing, pictured in figure 2 and elaborated
on in figure 3, has two convolutional layers. The first convolutional
layer has 64 filters with a filter size of 4, and the second has 32 filters
with a filter size of 6. These were starting points to test that the
model has functionality. These parameters could be further tuned
to improve the accuracy of the model. A larger number of filters
and filter sizes in the range of 3–5 are recommended for sentiment
analysis using a CNN in order to capture a large amount of contextual
information based on the relations between closely adjacent
words. The dropout layer in the diagram is intended to prevent
overfitting.

Figure 2: Convolutional neural network architecture.
Figure 3: CNN layer type, output shape, and number of parameters.

Latent Dirichlet Allocation (LDA) topic models are widely used for
identifying underlying topics in a set of documents. Given K topics
to find, an LDA model will sort words in a document into various
topics based on the words they appear most frequently with and
their respective topics. This works through a generative process
where the LDA model assumes the following: a document is composed
of an assortment of topics and each topic is composed of
an assortment of words. It then reverse engineers this process in order
to identify the topics in a corpus. Figure 4 is the graphical plate
notation where:

  • M is the number of documents
  • N is the number of words in a document
  • α is the per-document topic distributions
  • β is the per-topic word distribution
  • m is the topic distribution for document m
  • φ_k is the word distribution for topic k
  • Z_mn is the topic for the nth word in document m
  • W_mn is the specific word
Figure 4: Diagram of smoothed LDA model.

For the purposes of sorting customer service transcript data into
topics, an online technical support discussion board and FAQ section
of a wireless carrier was used. By using this dataset, the topics
could be predetermined and the model could be tuned to identify
specific topics. Figure 5 provides examples of topic information
which can then be used to identify topics. In order to do this,
β was defined in the following way:

  • For each topic, all words were initially given a 1 / corpus
    vocabulary size
    distribution
  • Then words that were predetermined to be more frequent
    for a given topic were heavily biased (1000 times more weight)

This enabled the model to bias toward finding specific topics in a
document rather than determining its own.

Figure 5: Example of raw output from the LDA model.

DESIGN DECISIONS

Some tools were across all sections of the project. The general purpose
data science tools used throughout the project include:

  • Python: Python is a powerful, general-purpose programming language
    with a robust set of libraries. It is an industry standard
    with wide support networks.
  • TensorFlow: TensorFlow is an open source library for deployment of high
    performance numerical computation across different platforms.
  • Anaconda: The Anaconda distribution contains a plethora of data science
    packages, including conda, which is an OS-independent
    package and virtual environment manager, allowing for easy
    package installation across machines.
  • Jupyter Notebook: Jupyter notebooks are documents which contain text and embedded code which can be easily run within the document.
    They are simple to share and provide high-quality,
    interactive output for our python code. The prose-like format
    of Jupyter notebooks’ interpreter, markdown, and visualizations
    environment facilitates code and concept collaboration.
  • Pandas: Pandas is a powerful python library which is ubiquitous in
    data science. In this project it was used for managing training
    and testing data.
  • Matplotlib: Matplotlib is a python library for data visualization. In this
    project it is used for final visualization of sentiment analysis
    and topic modeling data.
  • Pickle: Pickle is a python library for object serialization and data
    storage. In this project it is useful for exporting model configurations
    and related dataframes.

As this was a far-reaching project with many moving parts, a variety
of technical tools were required to specifically accomplish the
goal of each major utility:

Speech-to-text transcription

  • ffmpeg: Due to the way speech-to-text engines process audio
    files, it is important to use audio encoding that is consistent
    with that expected by the engine. FFmpeg is a free, lightweight
    command-line audio encoder that fits seamlessly into
    our audio processing workflow. The final encoding pattern
    consisted of mono-channel audio with a 16 kHz sampling frequency,
    a 16-bit bit depth, and 300/3000 Hz high/low pass
    filtering to specifically capture voice audio.
  • Audacity: Beyond encoding, our data required a small amount of editing, mostly in the form of phrase partitioning. Audacity’s
    “Silence Finder” utility allowed us to break long audio
    clips into phrase-long sound bites and export them easily
    for immediate transcription processing in DeepSpeech.
  • DeepSpeech: DeepSpeech is a free speech-to-text engine with
    a high accuracy ceiling and straightforward transcription
    and training capabilities. DeepSpeech also comes with pretrained
    models that can be refined via transfer learning to
    improve accuracy on any particular type of audio, avoiding
    the need for extended training times or an extremely large
    dataset. Our transfer learning process involved removing
    the final interpretive layer of DeepSpeech, replacing it with
    a randomly-initialized layer, freezing all sublayers and training
    the final layer.

Topic model

  • gensim: gensim is a general topic modeling library for python
    that has many useful features. Gensim was used primarily
    for its LDA model as it is very powerful and easy to train
    and tweak to our needs.
  • NLTK: NLTK stands for Natural Language Toolkit and is a
    python library used for natural language processing. This library
    was used for our topic model to normalize documents
    before running them through our LDA model.
  • pyLDAvis: pyLDAvis is a python package used for visualizing
    LDA models and was used for this project because it
    interfaces well with gensim. pyLDAvis allowed us to easily
    visualize our LDA model’s performance.
  • Stanford NER: Stanford Named Entity Recognizer model was
    used for identifying named entities in a corpus such as persons,
    locations, and organizations. This helped us identify
    the subject of a document.
  • Stanford POS Tagger: Stanford Part of Speech Tagger was used for identifying the part of speech for each word in a corpus. This helped us identify the subject, verb, and object of call transcript sentences.

Sentiment model

  • Word Embedding: Tensorflow_hub’s word embedding modules
    were chosen as input for our neural network classification
    models since they allow us to map sentences to numerical
    values and preserving word order information. These
    modules are easy to implement through tensorflow_hub
    and come pretrained.
  • Manual Feature Selection (MFS): Manual feature selection
    was used for the SVM and Random Forest classifier models
    because it allowed faster training, and therefore could handle
    a larger number of features. This allowed us to train our
    model to recognize vocabulary specific to the topic at hand.
  • (MFS) NLTK Stemming: In manual feature selection, words
    were stemmed so that words with the same root would be
    recognized as the same word which reduced our feature
    space. NLTK’s Stemming module was chosen for ease of use.
  • (MFS) Filtering: Words longer than 25 characters, words shorter
    than 3 characters, words appearing more than 2000 times,
    and words appearing less than 5 times were all filtered out
    to reduce the feature space.
  • Bigrams: Bigrams were added to the feature space to embed
    some information on word order.
  • Classification Models: A variety of classification algorithms
    were explored to have comparisons for the accuracy of our
    models on the training data. The models explored were an
    SVM classifier, Random Forest classifier, LSTM Recurrent
    Neural Network, and Convolutional Neural Network.
  • Sklearn, Tensorflow, and Keras: Tensorflow and Keras were
    used for neural network training. Sklearn was used for the
    other classification algorithms.
  • NLTK and NLTK-trainer: NLTK was used for sentiment analysis,
    since it has several prebuilt sentiment models. This lets
    us quickly set up a standard model and compare against our
    own model iterations.

PERFORMANCE RESULTS

Speech-to-text transcription

DeepSpeech, the automated speech recognition (ASR) engine
preferred by our client, is an end-to-end trainable, character level
LSTM recurrent neural network. DeepSpeech has been
used to produce a very low word error rate (WER), particularly
a 6% WER on LibriSpeech, an open-source 1000 hour
dataset of natural speech. 6% is approximately the human
error rate. Out of the box, DeepSpeech’s generic models returned
a 65.4% WER on our customer service call testing
data, more than 10x the WER of human processing.
This WER was ultimately reduced to approximately 18.7%
on the custom model generated using a very small corpus
of training data.

DeepSpeech functions best when translating sentence-length,
<10 second audio clips. The processing time required by Deep-
Speech to create a transcript per audio clip is between 4 and
9 seconds, with a mean of 7.14 seconds. Therefore, 5 minutes
of audio could take DeepSpeech almost 4 minutes of
processing time.

Sentiment model

Accuracy of various models were compared on 80/20 split
training/testing data made up of the IEMOCAP dataset and
a web-scraped dataset of electronic reviews with the sentiment
classification accuracies contained in figure 6. These
were 3-class classifiers where the sentiment classes were
positive, negative, and neutral. Accuracies of different feature implementations were then compared, with results contained
in figure 7. Some feature implementations did not easily
lend themselves to being run on different classification
models, so that data is absent.

Figure 6: Accuracy results of different classifiers for sentiment.
Figure 7: Accuracy results of different feature implementations.
Figure 8: Confusion matrix for training data on the final LSTM model.
Figure 9: Confusion matrix for testing data on the final LSTM model.

Among the different datasets and implementations the best
results were on Neural Network Classifiers using the tf_hub
word embedding module. Of the word embedding modules,
the universal sentence encoder outperformed the other modules.
A deeper neural network outperformed sklearn’s DNNClassifier
under the same feature implementation, and was the
highest performer of all the implementations at about 10%
higher test set accuracy than the second contender. The best
model iteration was an LSTM RNN that used tf_hub’s universal
sentence encoder as feature input. This model had a
test set accuracy of 77.6% on a combined IEMOCAP and
electronic review dataset. Baseline accuracy for randomly
choosing a class for 3 classes is 33.3%.

Topic Model

Using data scraped from a wireless carrier’s online discussion
forums and FAQ section, the training data was organized
into 5 topics: accounts and services, android products,
apple products, network and coverage, and other products.
Each of these topics also included some keywords that were
seeded into the LDA model with bias (such as “iphone” for
apple products, “galaxy” for android products, etc). Using
this training data to train the LDA model, it was able to successfully
predict the topics on the testing data 57.6% time. Baseline accuracy for randomly choosing a class for 5 classes is 20%. A visualization of this LDA can be found in figure 10.

The part-of-speech subject identifier (PoSSI) model was trained
using corpora included with the Natural Language Toolkit
(NLTK) and then tested on the web-scraped wireless carrier
data described above. This model was able to produce
coherent subject-verb-object (SVO) results, but because the
web-scraped dataset is unlabeled, there is not a quantitative
way of analyzing the accuracy unless all web-scraped data
was hand-labeled for part-of-speech. Hand labeling the entire
web-scraped dataset was not in the scope of this project.

Visualization and analysis of customer call data

Visualizing sentiment over the course of a call showed trends
we might expect: a green, positive, first sentence introduction
followed by a red, negative, introduction of the problem,
and darker greens (i.e. more strongly positive sentiment) towards
the end of the call.

These are promising results that suggest sentiment analysis
could be applicable in the context of automating customer
satisfaction surveys.

Figure 10: A visualization of the topics and the vocabulary found in the training corpora.
Figure 11: Top left: sentiment for each clip of the call displayed over the call transcript; Top right from top down: a heat-map of the intensity of the sentiment of each sentence of the call, a line graph displaying the intensity of the sentiment during each sentence of the call; Middle: the distribution of sentiment among keywords for each category; Bottom: the distribution of probabilities the call belongs into one of the five topic categories.
Figure 12: Sentiment distribution for sample service keywords.

CONCLUSION

Speech-to-text transcription

DeepSpeech is highly trainable but, with 120 million parameters,
requires massive training corpora, hundreds of hours
of training, and expensive equipment. Transfer learning was
performed in the form of freezing all sublayers of the model
and retraining the final layer. Experimentation with altering
previous layers was unable to create improvements. Transfer
learning is a viable course of action to customize Deep-
Speech for future transcription requirements but should be
trained using a greater number of high-quality audio clips.
DeepSpeech performed significantly better on our testing
data than alternative ASR engines such as Kaldi and CMUSphinx
as well as traditional ASR approaches such as Hidden
Markov Models.

Future work
Further improvements could likely be made in preparing
low-quality call data for transcription, as well as in post-processing
the transcriptions to produce more correct speech
(as DeepSpeech is character-level, it sometimes produces
nonsense). Retraining the weights of DeepSpeech on a large,
industry-specific copora would also likely produce greater
accuracy, but would be significantly more expensive — both
in computation time and amount of data required.

Sentiment model

The two biggest impacts on the accuracy
of the tested sentiment analysis models were data preparation
and model selection. Data preparation and input formatting
have a larger impact on a smaller dataset. Model selection
has a larger impact on a larger data set. Throughout
the various model iterations we tried, three different types
of formatting for data input were used:

  • Manual Feature Selection (used in SVM and RandomForest-
    Classifier): Manual Feature Selection was done by using the techniques
    stemming, filtering, and tokenization. Words were
    stemmed using NLTK’s word stemming module, so all
    words of different tenses would correspond to the same
    token. Words were not included in the training/testing
    sets when their length was less than 3, greater than 25,
    when they appeared more than 2000 times or less than 5
    times in the full corpus. Tokens were then created for all
    viable unigrams and bigrams in the corpus.
  • Pre-trained Word Embedding Modules (used in DNNClassifier)
    - Pre-trained word embedding modules from tensorflow_hub
    were used for the DNN Classifier.
    - The NNLM embedding is based on a neural network
    language model with two hidden layers. The least frequent
    tokens are hashed into buckets.
    - Word2vec is a token based embedding module trained
    on a corpus of English Wikipedia articles.
    - Universal-Sentence-Encoder encodes sentences for use
    in text classification. It was trained using a deep averaging
    network and has the feature of encoding semantic
    similarity.
  • Tokenization: Words with a length less than one were removed, then
    sentences were vectorized using Keras’s text preprocessing
    module. A token was chosen for every word in the
    training input. All sentences were padded to be of equal
    length, and words were replaced with tokens.

Given the relatively small vocabulary size of our data set,
using a Pre-trained word embedding module gave the best
results, outweighing the effect of what model was used. Alternative
models were also developed, such as:

  • SVM and Random Forest: These two models were the first
    to be run as a way to gauge what sort of accuracy was
    attainable, and what more complex models should be expected
    to beat. The sklearn SVM Classifier and Random-
    ForestClassifier were used for this task. These are two
    types of commonly used high performing classifiers.
  • DNN: This model is a prebuilt Deep Neural Network from
    Tensorflow. This model was used to evaluate the performance
    of a Neural Network on the dataset, and to test use
    of tf_hub’s word embedding modules.
  • LSTM: LSTMs are currently an industry standard model
    for sequential data like natural language.
  • CNN: Convolutional neural networks have been shown
    to provide excellent results on sentence classification by
    capturing information about n-grams through the use of
    convolutional layers.
  • CNN-LSTM: a combined CNN-LSTM model has been shown
    to work well by encoding the regional information with
    convolutional layers and long distance dependencies across
    sentences using LSTM layers.

With a small-to-mid sized training dataset, there is a much higher risk of overfitting to your training set when using neural networks with thousands of parameters, such as RNN and CNN neural networks. This can be seen by large discrepancies between testing and training set accuracies in our performance results. The ideal point at which to stop training is right as the training and testing scores begin to diverge where the training score continues going up, but the testing score starts going down. In the context of training a model, this means it is important to save checkpoints as you train, so the model can be reverted to an older checkpoint which is not yet overfitting. See Appendix B for an example of training and testing scores over the epochs of training. Overfitting can also be reduced by adding a regularization parameter to the loss function or to introduce dropout layers. Adding dropout to layers has seen success in most types of neural networks. However, this method does not work well on RNNs with LSTM units, which is the type of model that showed the greatest performance in this project. To mitigate overfitting on our small training dataset we apply dropout only to a non-recurrent layer.

Future work

  • Manual Feature Selection: When doing manual feature
    selection the NLTK stemming module was used to find
    word roots. Lemmatization would likely give better results,
    but requires part of speech tagging. Implementing
    this would create a more accurate feature space. This could
    be done using NLTK’s WordNetLemmatizer and NLTK’s
    pos_tag in conjunction.
  • Training Data: Use a larger training dataset. The IEMOCAP
    data set has a vocabulary size of 3,000 words, which
    is relatively small, and testing data could easily contain
    unseen words. This was improved by the inclusion of the
    electronic review dataset which increased the vocabulary
    size to around 8,000 words. The data, furthermore, is not
    as domain specific as could be. IEMOCAP very general
    purpose, and performed by actors reading scripts about
    everyday situations. The electronic reviews are for electronic
    products similar to those talked about in a technical
    troubleshooting wireless carrier customer call, but are not
    recent. More data specific to the content discussed in wireless
    carrier customer service calls could be introduced,
    such as more recent electronic device reviews or manually
    annotated wireless carrier discussion forum posts.
  • Pre-Trained Word Embedding Modules: Transfer learning
    could be implemented on these modules using a corpus
    of text relevant to the type of text data on which classification
    is being attempted. In this case the word embedding
    modules could be further trained on a corpus of
    wireless carrier related text, such as data scraped from a
    specific wireless carrier’s online FAQ section and discussion
    forum posts.
  • CNN and LSTM Neural Networks: The fine tuning of layers
    and their parameters, and feeding the model more
    data could be improved on the neural network models.
    These efforts would be for the purpose of increasing accuracy
    and decreasing overfitting.

Topic Model

The low accuracy result of the LDA model is expected as
LDA models are not meant to be forced into predefined topics
and instead should be used for finding hidden topic information
in the corpus. SVO extraction using POS and NER
tagging was able to produce coherent results, but these results
are not consistent across topics which presents a challenge
when trying to sort documents by topics.

Future work

  • The LDA model could be improved by using a more domain-specific
    part-of-speech tagger (rather than the default POS
    tagging function from NLTK) as well as better refined alpha
    (document-topic density) and beta (topic-word density) parameters.
  • The PoSSI model could be improved in a few ways:
    - Using a POS tagger and a Named Entity Recognition
    (NER) tagger trained on domain-specific corpora
    - A more domain-specific model for the POS tagger (currently
    uses a research model from Stanford)
    - A more domain-specific model for the NER tagger (currently
    uses a research model from Stanford)
    - A better method of extracting the subject from a tagged
    document such as a parser in combination with a context-free
    grammar for the domain-specific corpora
    - Speed improvements as the model can be very slow for
    large documents.
    - Both of these models could be combined to identify topics
    in the generated SVOs.

BIBLIOGRAPHY

A. G. Tripathi and N. S, ”Feature Selection and Classification Approach
for Sentiment Analysis,” Machine Learning and Applications:
An International Journal, vol. 2, no. 2, pp. 01–16, 2015.

B. C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim,
J. Chang, S. Lee, and S. Narayanan, ”IEMOCAP: Interactive emotional
dyadic motion capture database,” Journal of Language Resources
and Evaluation, vol. 42, no. 4, pp. 335–359, December 2008.

C. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, ”A Neural
Probabilistic Language Model,” Journal of Machine Learning Research
, vol. 3, pp. 1137–1155, Mar. 2013.

D. T. Mikolo, K. Chen, G. Corrado, and J. Dean, ”Efficient Estimation
of Word Representations in Vector Space,” CoRR, vol. abs/1301.3781,
Jan. 2013.

E. D. Cer, Y. Yang, S.-yi Kong, N. Hua, N. Limtiaco, R. St. John,
N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B.
Strope, and R. Kurzweil, ”Universal Sentence Encoder,” CoRR, vol.
abs/1803.11175, 2018.

F. W. Zaremba, I. Sutskever, and O. Vinyals, ”Recurrent Neural Network
Regularization,” CoRR, vol. abs/1409.2329, 2014.

G. M. Hu and B. Liu. ”Mining and summarizing customer reviews.”
Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery Data Mining, Aug. 2004.

H. N. Jindal and B. Liu. ”Opinion Spam and Analysis.” Proceedings
of First ACM International Conference on Web Search and Data
Mining, Feb. 2008.

I. Qian Liu, Zhiqiang Gao, Bing Liu and Yuanlin Zhang. Automated
Rule Selection for Aspect Extraction in Opinion Mining. Proceedings
of International Joint Conference on Artificial Intelligence,
July 2015.

J. Y. Kim, ”Convolutional Neural Networks for Sentence Classification,”
CoRR, vol. abs/1408.5882, 2014.

K. TensorFlow. (2018). Text Modules | TensorFlow. [online].

L. X. Ma and E. Hovy, ”End-to-end Sequence Labeling via Bi-directional
LSTM-CNNs-CRF,” Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics, vol. 1, 2016.

M. J. Cheng, L. Dong, and M. Lapata, ”Long Short-Term Memory-
Networks for Machine Reading,” Proceedings of the 2016 Conference
on Empirical Methods in Natural Language Processing, 2016.

N. Y. Goldberg, ”A Primer on Neural Network Models for Natural Language Processing,” Journal of Artificial Intelligence Research,
vol. 57, Nov. 2016.

O. J. Wang, L.-C. Yu, K. R. Lai, and X. Zhang, ”Dimensional Sentiment
Analysis Using a Regional CNN-LSTM Model,” Proceedings
of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), 2016.

P. A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E.
Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng,
”Deep Speech: Scaling up end-to-end speech recognition,” CoRR,
vol. abs/1412.5567, 2004.

Q. J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and
S. Stober, ”Transfer Learning for Speech Recognition on a Budget,”
Proceedings of the 2nd Workshop on Representation Learning for
NLP, 2017.

R. S. Bansal and Natural Language Processing and Machine Learning,
”Beginners Guide to Topic Modeling in Python,” Analytics Vidhya,
29-Aug-2016. [Online].

S. J. Brownlee, ”How to Develop an N-gram Multichannel Convolutional
Neural Network for Sentiment Analysis,” Machine Learning
Mastery, 14-Feb-2018.

T. S. Axelbrooke, ”LDA Alpha and Beta Parameters — The Intuition,”
LDA Alpha and Beta Parameters — The Intuition | Thought Vector
Blog, 21-Oct-2015. [Online].

U. A. Crosson, ”Extract Subject Matter of Documents Using NLP,”
Medium, 08-Jun-2016. [Online].

V. J. R. Finkel, T. Grenager, and C. Manning, ”Incorporating nonlocal
information into information extraction systems by Gibbs
sampling,” Proceedings of the 43rd Annual Meeting on Association
for Computational Linguistics — ACL 05, pp. 363–370, 2005.

W. K. Toutanova and C. D. Manning, ”Enriching the knowledge
sources used in a maximum entropy part-of-speech tagger,” Proceedings
of the 2000 Joint SIGDAT conference on Empirical methods
in natural language processing and very large corpora held in
conjunction with the 38th Annual Meeting of the Association for
Computational Linguistics, pp. 63–70, 2000.

X. S. Bird, E. Klein, and E. Loper, Natural language processing with
Python: Analyzing Text with the Natural Language Toolkit. Beijing:
OReilly, 2009.

Y. Smoothed LDA. Wikipedia, 2009.

Z. S. Ruder, ”Deep Learning for NLP Best Practices,” Sebastian Ruder,
25-Jul-2017. [Online].

Original PDF of our Final Report

--

--