Why Zceppa Uses the GPT-4 LLM for Review Analysis?

Why Zceppa Uses the GPT-4 LLM for Review Analysis?

Why Zceppa Uses the GPT-4 LLM for Review Analysis?

With the advancements, proliferation, and success of LLMs, brands today can use AI to understand human behavior, feelings, and emotions behind written, spoken, and even just body language. Language models have pushed the capability to define nuance and insights to a hitherto unheard-of level.

Multi-modal AI today is developing in ways that the average human mind cannot even imagine – we get closer to the point where “if you can imagine it, you can do it”! As I type, I am being proofread; as I drive, I get re-routed, and my expressions, voice, and tone can all be deciphered using AI. 

Understanding customer feedback and public reviews using AI is a very effective use case for brands. Multi-lingual and multi-modal AI capability is pushing the boundaries of what can be achieved – Rather than expend resources transcribing data from one format to the next and then using code to decode it, why not let AI do this for you? 

Use Cases for AI in Review Analysis

Over the last 24 months, Large Language Models have transformed how brands work with user-generated content and data. Zceppa had a clear roadmap for AI infusion into the platform.

  • Sentiment Analysis: Understanding customer emotions and perceptions at scale.
  • AI Replies to Reviews: Enhancing responses to reviews through human-like interactions.
  • Fake Review Detection: Identifying and mitigating the impact of fraudulent reviews.
  • Topic and Theme Extraction: Categorizing reviews by key areas such as pricing, service, or product quality.
  • Trend Identification: Spotting emerging issues or opportunities from feedback.
  • Enhanced Customer Interaction: Creating personalized, data-driven responses to reviews.

Several of our customers had been talking about “Sentiment Analysis” – the timing was just right for Zceppa. 

We had a first customer who was really keen to pick apart the negative reviews customers were leaving them. They did not have the bench strength or the nuance to perform this action manually – going through multitudes of feedback and classifying them would have been expert analysis even 4 years back – today, this is just another area to look for efficiency!  

Rather than decode this manually or just handle it one at a time, they could dive deeper and understand all the keywords customers used to describe the brand. What made this particularly compelling was that they wanted to track keywords mapped to critical areas of their brand experience and see customers’ negative perceptions.

There were really good opportunities within this problem set to identify areas for the brand to improve and take action. Zceppa’s AI Use Cases fell right in line with what this presented, and we got to work.

LLM Tools Explored

One of the first things we considered as a team was to look at APIs that already provided this functionality, which we could easily integrate into our platform. 

We went through several tools, including – Gcloud sentiment analysis, Text2Data, NLTK, spaCy that would provide a clear sentiment and score to the piece of content that could be transcribed to customer sentiment. Additionally, it also had to identify patterns of keywords that emerged.

One of the primary reasons we started looking at native generative AI for this Use Case was because of the various limitations offered by other API-based tools. 

Review and feedback data is unstructured; there were no limits on the quantum of the content any user/ reviewer could write, and a single review or piece of content could have multiple keywords with nuanced sentiments. Also, multiple languages needed to be considered. 

For the project to be successful, the tool/model needed to deal with all of these varied nuances in a human-like fashion.

We first set up a high-quality data source to use as a control set across the different tools to compare our results.

Good Read: How Zceppa Uses GPT-4 AI Models To Analyse Customer Reviews at Scale

Key Criteria Used for the Test Data Set-Up

  1. Source of Data
  • The datasets were collected from publicly available reviews.
    Examples include:
  • TripAdvisor Reviews
  • Reviews Related to Healthcare Services
  1. Type of Data
  • Textual Data: User-generated feedback in the form of multilingual reviews.
  • Sentiment-Oriented: Primarily analyzed user sentiments (positive, negative, neutral).
  1. Size of Data
  • The dataset size ranged from hundreds to thousands of reviews, depending on the source and scope of analysis.
  1. Diversity
  • The data covers multiple domains, such as travel, services, and healthcare, and was multi-lingual, providing a comprehensive perspective.
  1. Quality of Data
  • Since publicly available reviews were used, there was a high relevance. 

We started testing out the key LLMs available, including

  • GPT-3 from Open AI
  • GPT-4o
  • Gemini from Google
  • Amazon Comprehend
  • Custom AI Models
    • Hugging Face (bert-large-cased-finetuned-conll03-english and xlm-roberta-large-finetuned-conll03-english models)
    • Spacy (en_core_web_lg, en_core_web_md, en_core_web_sm models)

Here’s a brief insight into each tool’s unique capabilities and limitations in handling large-scale review data.

1. GPT-3

We started with the GPT-3 tool to analyze customer feedback/reviews. We were particularly interested in how the language model perceived the tone of the content, and there were some observations.

  • Excellent in handling structured and unstructured text data.
  • Proficient in sentiment analysis and nuanced text interpretation.
  • Supports multiple languages, making it versatile for diverse datasets.

Limitations:

  • Required fine-tuning for domain-specific analysis (e.g., healthcare reviews).
  • Context window limitation (up to 4,096 tokens), which may hinder the processing of large reviews
  • Cost prohibitive for large-scale data processing.

Token Pricing:

  • Input: $0.006 per 1,000 tokens
  • Output: $0.012 per 1,000 tokens
  • Significantly more expensive compared to GPT-4o for large-scale processing.

Use Case:

  • Suitable for moderate workloads and scenarios where GPT-4 features are not necessary.
  • Legacy model with wide adoption in earlier AI applications.

Performance:

  • Good quality outputs for general text processing.
  • Lacks the efficiency and optimizations of newer models like GPT-4o.

2. GPT-4o 

GPT-4o provided more of an advantage over the earlier model.

  • Enhanced understanding of context, especially in complex or ambiguous text.
  • Handles up to 32,768 tokens in its context window (an improvement over GPT-3).
  • Stronger reasoning abilities and more accurate sentiment detection in domain-specific cases

Limitations:

  • Higher computational requirements and costs.
  • Fine-tuning is not natively supported by OpenAI (reliance on embeddings or external techniques).

Token Pricing:

  • Input: $0.00075 per 1,000 tokens
  • Output: $0.001 per 1,000 tokens
  • Most cost-effective option, especially for large-scale processing or batch jobs.

Performance Optimization:

  • Designed to deliver comparable output quality to GPT-4 with reduced computational costs.
  • Ideal for projects requiring scalability and high throughput.

Use Case:

  • Best suited for handling massive datasets (e.g., 100k+ reviews).
  • Perfect for organizations seeking quality and affordability without sacrificing performance.

Comparison:   

  • Input Tokens$0.00525 more per 1,000 input tokens
  • Output Tokens$0.011 more per 1,000 output tokens

GPT Pricing 

  1. GPT-3 Cost:
  • Input: $6
  • Output: $12
  • Total: $18
  1. GPT-4o Cost:
  • Input: $0.75
  • Output: $1
  •  Total: $1.75
  1. Additional Cost of GPT-3:
  • $16.25 more per million tokens processed compared to GPT-4o.
Good Read: How to Get Your Google Business Verification at Scale

3. Google Gemini

Gemini’s ability to handle text and other data types (e.g., images) makes it versatile for analyzing complex datasets. This is particularly useful for extracting insights from reviews that may include multimedia content.

  • Compared to other LLMs like GPT-4, Gemini offers a cost-effective solution for large-scale data processing due to its integration with Google’s cloud infrastructure and optimized token handling.
  • Seamlessly integrates with Google Cloud tools for analytics, enabling scalable deployment and easy pipeline building for large datasets.

Limitations:

  • Gemini struggles with processing multiple languages, often leading to misinterpretation of context in reviews that involve non-English content.
  • The model exhibits limitations in detecting sarcasm, often resulting in false positives for sentiment analysis, especially in nuanced datasets like healthcare reviews – the higher % of false positives meant the actual nuance would be missed.

4. Amazon Comprehend​

Amazon Comprehend uses machine learning to analyze text and derive insights such as language detection, key phrase extraction, named entity recognition, and sentiment analysis.

Sentiment Analysis in Amazon Comprehend helps detect the overall sentiment of a document or a text snippet. It classifies sentiment into the following categories:

  • Positive
  • Negative
  • Neutral
  • Mixed

Key Features:

  1. Scalability: Automatically scales to handle varying workloads.
  2. Real-time or Batch Processing: Supports synchronous and asynchronous modes for processing text.
  3. Integration: Works seamlessly with other AWS services such as S3, Lambda, and Redshift.
  4. Multilingual Support: Can detect sentiments in multiple languages.

Limitations

  1. Accuracy in Complex Sentences:
    • Struggles with sarcasm, irony, or nuanced expressions.
    • May misinterpret sentiments in texts with mixed tones.
  2. Limited Language Support:
    • While it supports multiple languages, its performance varies across languages, especially for less widely spoken ones.
  3. No Customization for Domain-Specific Use Cases:
    • Cannot be fine-tuned for specific industries or specialized vocabularies (e.g., healthcare, legal, or technical domains).
  4. Cost Implications:
    • Costs can escalate for high-volume or large-scale sentiment analysis, making it less ideal for smaller businesses with limited budgets.

5. Custom-Trained AI Models

Capabilities:

  • Specifically tailored to the dataset (e.g., healthcare reviews), offering a deep understanding of domain-specific jargon and sentiment.
  • Enables training with metadata and custom features like star ratings, improving sentiment accuracy.
  • Cost-effective for long-term, repeated analysis once trained.

Limitations:

  • High upfront resource investment in terms of data preparation, training, and infrastructure.
  • Performance is dependent on the quality, diversity, and size of the training dataset.
  • Requires expertise to train, fine-tune, and deploy effectively.

Analysis of the Results 

OpenAI (GPT-4o)

1. Accurately handled multilingual content, sarcasm, and nuanced contexts.
2. Provided consistent outputs across repeated inputs.

Example: Hindi sentence “समुद्र तट बहुत गर्म थे।” was correctly identified as neutral.

gCloud NLP
1. Struggled with sarcasm and complex sentences.
2. Tended to overemphasize positive sentiment, leading to false positives.

Example: “Hospital services are best Parking problems small lift” was generalised as positive.

GeminiAI (1.5 Flash)
1. Inconsistent responses and missed key nuances in mixed sentiments, sarcastic remarks, and longer sentences.

Example: “It’s good, not bad, value for money” was generalized as positive.

Amazon Comprehend

1. Pre-trained models are used for decent accuracy and support multiple languages.

2. Cost was way higher than Open AI and Google and the model struggled with sarcasm or mixed sentiments.

Hugging Face ML

1. Open source and support multiple pre-trained models for basic sentiment analysis, but accuracy was not the best with pre-trained models.

2. Required a high level of ML expertise for fine-tuning models along with comparably greater hardware allocation for the project 

Summary of Observations of each LLM

LLMProsCons
GPT-4Superior contextual understanding, multi-language support, scalableHigher computational cost
GPT-3Cost-effective, handles basic sentiment analysis wellLimited contextual depth, lower accuracy
Google GeminiStrong search integration, lightweight AILimited availability, evolving ecosystem
Custom Models
Tailored to specific needs, complete control
Time-intensive development, resource-heavy
Amazon ComprehendNative AWS integration, cost-effective for large-scale use, supports multiple languagesLimited contextual depth, weaker in handling sarcasm and nuanced sentiment

Conclusion

Teams managing branding and customer experience have long relied on tools and technology to conduct surveys and listen to what customers say about their brands. 

With the advancements in technology and the continuous lowering of the cost of computing, using Artificial Intelligence powered by LLMs is becoming mainstream. While this is still an early day regarding business appetite, there is sufficient interest and budget for AI, especially in healthcare.

Utilizing Generative AI and Language Models to understand millions of data points of customer feedback is a huge leap forward – besides being able to operate 24X7 and crunch millions of data in seconds, another essential aspect could be about how AI can potentially remove some bias – Although most AI skeptics point to hidden and large biases within AI.

Within the context of our testing, OpenAI outperformed others in multilingual sentiment analysis, nuanced understanding, and sarcasm detection while delivering reliable results for repeated inputs and ensuring dependability. 

With the GPT40 model, Zceppa could build a comprehensive solution for customer sentiment – from POC to V1 within 60 days! 

Explore how Zceppa’s GPT-4 integration can transform your review management strategy. Try a demo today!

Frequently Asked Questions

Why not use GPT-3 instead of GPT-4?

1. Contextual Understanding

  • GPT-4 offers superior reasoning and nuanced sentiment detection, especially in complex or domain-specific data, compared to GPT-3.
  • It can handle ambiguity, subtle emotional cues, and sarcasm more effectively.

2. Token Limit

  • GPT-3 is limited to 4,096 tokens, restricting its ability to process lengthy reviews or large datasets at once.
  • GPT-4 extends this to 32,768 tokens, making it ideal for handling verbose reviews or batch processing.?

3. Accuracy and Reliability

  • GPT-4 demonstrates a higher accuracy rate in multilingual and domain-specific tasks, reducing false positives/negatives in sentiment analysis.

How does GPT-4 handle multilingual reviews?

What makes GPT-4 scalable for enterprise-grade applications?

What are the data privacy measures in place when using LLMs?

How does Zceppa’s integration ensure real-time analysis and actionable insights?

Why not use GPT-3 instead of GPT-4?

1. Contextual Understanding

  • GPT-4 offers superior reasoning and nuanced sentiment detection, especially in complex or domain-specific data, compared to GPT-3.
  • It can handle ambiguity, subtle emotional cues, and sarcasm more effectively.

2. Token Limit

  • GPT-3 is limited to 4,096 tokens, restricting its ability to process lengthy reviews or large datasets at once.
  • GPT-4 extends this to 32,768 tokens, making it ideal for handling verbose reviews or batch processing.?

3. Accuracy and Reliability

  • GPT-4 demonstrates a higher accuracy rate in multilingual and domain-specific tasks, reducing false positives/negatives in sentiment analysis.

How does GPT-4 handle multilingual reviews?

What makes GPT-4 scalable for enterprise-grade applications?

What are the data privacy measures in place when using LLMs?

How does Zceppa’s integration ensure real-time analysis and actionable insights?

Signup for a free trial

Zceppa’s products empower your business to win every mobile-first consumer interaction across the buying journey.