Transformational Efficiencies: Tickr AI vs. CPG Human Experts
Large language models (LLMs) are revolutionizing and automating the field of data analysis & knowledge-intensive tasks. The realm of Consumer Packaged Goods (CPG) is ripe for large transformational gains leveraging large language models. These technologies have provided new tools and methodologies to drive sales and business development, and to capitalize on opportunities in non-traditional channels. A prime example of this revolution is the ChatCPG platform, a cutting-edge system designed to deliver CPG-specific task automation using LLMs integrated with state-of-the-art data science models. This white paper provides an in-depth look at the performance and some of the capabilities of Tickr’s LLM platform, based on a blinded internal experiment conducted by Tickr Research in July 2023.
The task we explored was answering complex, multistep analytic knowledge-intensive queries over a set of white papers written by Numerator & Kantar. The breadth of these documents ranges from a mobility survey to a study on growth strategies. The type of system we are discussing is a retrieval augmented generation (RAG) system. RAG systems enhance the capabilities of LLMs by allowing non-parameterized (not encapsulated in the LLMs weights) knowledge to be fetched by an LLM when it deems it needs additional knowledge to answer a query or solve a task. This non-parameterized knowledge could be fetched from the internet, a database, a company’s documentation, a vector store and much more. For this article, the non-parametrized knowledge being fetched is a set of white papers written by Numerator & Kantar.
ChatCPG’s performance is enhanced by its adaptive query engines, chain-of-thought reasoning, ReAct agents, hybrid retrieval, and much more - all enabling it to select the most appropriate query engine and even RAG system for the task at hand. Our system has high precision and recall search capabilities, achieved through techniques like hybrid search and LLM rerankers. These features allow ChatCPG to deliver accurate and precise results, even when dealing with complex questions or intricate promotional strategies. Looking to the future we are planning on integrating and doing research around chain-of-thought reranking, multi-modal retrieval incorporating vision and time series, and automatic tuning of RAG architectures to a clients data.
We are also expanding ChatCPG‘s scope to include more file and content types for analysis, such as consumer surveys, white papers, sales reports, and company dashboards. This white paper shows how we ran a blinded experiment, comparing ChatCPG to both Azure OpenAI and a human CPG-expert, outperforming them both on complex multi-step tasks. Our ChatCPG platform is built on the vibrant open-source LLM community using tools like LlamaIndex, LangChain, and open-source LLMs like Llama 2 & Zephyr as well as Weights & Biases & OpenAI.
The methodology of our experiment involves a multi-step process to evaluate the diverse capabilities of the ChatCPG platform. Seven different tasks were chosen to ensure a comprehensive analysis of the platform’s capacity to process and analyze unstructured heterogeneous data. In addition, we also chose tasks that required reasoning across multiple documents in order to synthesize salient and correct responses.
Following the selection of data, we proceeded to index all the information contained in these documents. This step was crucial as it formatted the data in a manner conducive to the efficient retrieval and analysis by the ChatCPG platform. Each document was exposed as a tool (action) that our ReAct agent could select. Each tool exposes two types of query engines; one designed for summarization tasks and another designed for semantic search tasks. These query engines will be referred to as sub-query engines. Using multiple sub-query engines does increase latency, but it provides a lot more flexibility and accuracy when it comes to answering queries. The latency incurred is due to our ReAct agent performing multiple steps of thoughts, actions, and observations (see Yao et. al 2023 for more details) to synthesize complex responses.
When ChatCPG is queried the ReAct agent selects which query engine tool to choose based on the query. For example, if the question "How can we tailor our marketing campaigns to address the changing mobility habits and preferences of young urban dwellers in Europe and North America, as revealed in the MovinOn Mobility Survey?” is prompted to our system, ChatCPG will choose the appropriate query engine tool (index) to answer the question. If semantic search is required, the ReAct agent selects the appropriate index to solve this task. On the other hand, if it is a summarization task the ReAct agent will select an index appropriate for a summarization task.
In addition to indexing the documents into ChatCPG, we did the same with Azure OpenAI. Data was indexed using their Azure Cognitive Search offering and queried via their API. Not all of the details of their indexing strategies are exposed to us, but we presume they’re using a vector search as their backbone. We ran the same queries through our engines as we did through Azure OpenAI, then recorded responses and timed them through their API. Azure OpenAI was included as they are the most established enterprise offering for RAG as a service.
A Human CPG expert was also given access to the same documents and prompted to answer the questions over the documents. The Human CPG expert has 30+ years of experience working for and with the largest CPG’s in the world and is familiar with the tasks in the experiment. They were asked to answer the questions and record the time it took for them to answer the queries.
The ChatCPG platform, Azure OpenAI, and the Human CPG Expert were graded on four metrics; clarity, detail, completeness, and accuracy. Most metrics measure both the retrieval and generation performance of the system, as they are decoupled processes in RAG systems. The metrics are rated on a Likert scale ranging from 1 to 10, where 1 represents the worst and 10 the best:
- Clarity: This is a generation metric because it measures how effectively the generated text communicates its point. A system must first understand the input (retrieval) to provide a clear response, but the metric focuses on how the response is articulated (generation). Clarity is essential for ensuring that the generated text is not only grammatically correct and logically structured but also easily understandable, without requiring specialized knowledge to interpret.
- Detail: Detail evaluates both the richness of information provided (generation) and the system’s ability to pull relevant information from a dataset or knowledge base (retrieval). A detailed response includes necessary facts, figures, examples, or explanations that enrich the answer, demonstrating the system’s capability in both retrieving comprehensive data and generating content that effectively utilizes this data.
- Completeness: This metric assesses both the retrieval and generation performance by looking at whether the system’s response fully addresses the prompt. In the retrieval phase, the system must access all relevant information related to the prompt. During generation, the system must then utilize this information to construct a response that leaves no critical aspects unexplored or questions unanswered. Completeness ensures the response covers all significant angles of the topic.
- Accuracy: Accuracy also measures both retrieval and generation as it requires the system to retrieve factual and truthful information and then generate a response that accurately reflects this information. The system must avoid errors, distortions, or “hallucinations” (generating plausible but incorrect or unverified information) and instead align with established knowledge. This ensures that the generated content is based on verified data and logical reasoning.
The ChatCPG platform, Azure OpenAI, and the Human CPG Expert were graded by another CPG-industry expert with 25+ years of experience in the CPG industry. Responses to tasks were shown to the expert in random order and they were asked to grade each triplet of responses. We masked out whether the response was authored by the ChatCPG platform, Azure OpenAI, or the Human CPG Expert. The evaluator was instructed to thoroughly review each document to appropriately evaluate the responses provided by the ChatCPG platform, Azure OpenAI, or the Human CPG Expert. To ensure impartiality, the origin of the responses was anonymized.
For instance, among the documents used in our study was the “Numerator Growth In Sight White Paper.” We posed the question, “What specific growth strategies can we implement to capitalize on the $101 billion opportunity in non-traditional channels, as outlined in the Numerator Growth In Sight Whitepaper?”. Addressing this query requires a nuanced understanding and a methodical approach for a human, which might include the following steps for a human:
- Read the paper to understand the context and details of the $101 billion opportunity.
- Identify key factors that contribute to growth in non-traditional channels.
- Determine potential areas of expansion.
- Develop recommendations.
- Articulate how strategy can be operationalized to tap into the identified opportunities.
On the other hand, for AI and RAG systems answering the query might require the following steps:
- Index paper into vector index and summary index.
- Combine indexes into a composable query engine.
- LLM creates tool summary of composable query engines for paper.
- Expose composable query engine as tool (action) ReAct agent can select.
- ReAct Agent selects appropriate tool based on query and tool summary.
- ReAct Agent selects which sub-index (summarization vs. semantic) to search.
- ReAct Agent transforms query to be most conducive for index & paper.
- High recall search is performed (k=10).
- LLM performs a relevance reranking for higher precision (k=3).
- Response synthesized conditioned on retrieved context to answer original query.
The steps outlined above serve as an illustration; generative models like LLMs, and the abstractions built on-top of them, such as ReAct Agents and Re-Rankers, can vary the process of synthesizing responses from one query to another and from one document to another. Additionally, if the ReAct Agent determines that it needs to retrieve data from multiple indexes for response synthesis, the process can involve ReAct agent loops.
In the context of the Numerator Growth in Sight White Paper, Response 1 is the CPG Human Expert, Response 2 is the ChatCPG platform, and Response 3 is Azure OpenAI.
These are growth strategies to implement in non-traditional channels where 84% of CPG growth occurs:
Understand what’s happening beyond the traditional channels of Food, Mass, and Drug by leveraging Numerator’s Single-Source Consumer Panel and TruView share reports.
Use the above for analyzing a wider range of shopper segments and behaviors to plan for innovation and assortment. This includes new product packaging, targeted brand messaging, online promotional adjustments, and appealing to broader shopper demographics.
Each response answers the question in different ways leading to differences in accuracy, detail, completeness, and clarity. More specifically Response 1 offers a qualitative approach to understanding consumer behavior with the help of specific analytical tools, but lacks breadth and detail.
Given the new context provided from the Numerator Growth in Sight Whitepaper, here are refined growth strategies:
Focus on High-Growth Departments: Prioritize departments that show high sales growth in non-FMD channels, such as Breakfast, Sexual Wellness Products, Batteries, Baby Safety, Paper & Plastic, Pregnancy & Maternity, and Alcohol Beverages. Tailoring your product offerings to these high-growth areas can tap into the observed growth.
Adjust Product Offerings: Given the diverse range of departments showing growth in non-FMD channels, it may be beneficial to diversify your product offerings. This could mean expanding into new categories or tailoring existing products to fit these departments.
E-commerce and Online Strategy: Given the high percentage of sales in non-FMD channels, an effective online strategy is crucial. Strategies should include visibility on digital shelves, understanding online consumer behavior, and adjusting promotional strategy for online shopping.
Leverage Data: Use single-source data to track behavior and sentiment, and activate from one data source. This can provide a full, omnichannel understanding of consumers and highlight growth potential.
Targeted Marketing: Since various departments are showing growth, targeted marketing can be employed. For example, marketing for Breakfast items can be directed towards morning routines, while Sexual Wellness Products can have a more discrete, targeted campaign
Adjust Competitive Strategy: Reassess your competitive brand strategy and market share to include non-FMD spend. Understanding how your brand compares to others in these channels can inform your strategy and help identify opportunities for growth.
Innovation in Product Development: Develop new products or improve existing ones based on the high growth departments. For example, if you are in the health and beauty sector, consider expanding into sexual wellness products.
Remember that the growth strategies should be backed by market research and consumer trends to ensure success.
Response 2 is a comprehensive, multi-faceted tactical response that presents specific actions to take across different areas of product and market strategy.
According to the Numerator Growth in Sight Whitepaper, there is a $101 billion opportunity in non-traditional channels, and 84% of CPG growth occurred outside traditional channels (Food, Mass, and Drug) in 2021. To capitalize on this opportunity, brands and retailers can inspire their innovation, channel, and consumer strategy. The report identifies four opportunities for any company to rally around and take advantage of this unlocked growth: 1) Look at your own category to see how much roadway exists in non-FMD channels before doubling down on future strategy for both innovation and costs, 2) Utilize TruView share to see expanded growth opportunities to determine if focusing growth in non-FMD can offset costly potential initiatives, 3) Adjust innovation pipeline to address both consumer and product needs for non-FMD channels, and educate on brands heavy in non-FMD, and 4) Utilize first-party and single-sourced consumer data to get a consistent and unbiased pulse on the total consumer.
Response 3 takes a broader view, suggesting strategic considerations that a company should integrate into its overall business approach for exploiting growth opportunities in non-FMD channels. In Table 1, we see the grading rubric for the Numerator Growth in Sight Whitepaper we required the grader to use to for assessing the responses. As you can see, the scores agree with our reasoning above. The grader also included notes on why certain grade decisions were made.
Table 1: Grading rubric filled out by CPG Expert
|Lacked any level of detail and summarized very poorly. Missed many key points from the article
|Very narrow focus on what was summarized
|Visually looks promising but missed badly on biggest growth opportunities (missed prior 3 pages of growth categories). Not backed by any figures to gauge size of the opportunity
Generation & Retrieval Quality Metric
The comparative performance analysis presented in Table 2 illustrates that ChatCPG has a marked advantage over both the CPG Human Expert and Azure OpenAI in several key metrics. Specifically, ChatCPG surpasses the CPG Human Expert by an average margin of 0.75 points in clarity, 1.50 points in detail, 1.13 points in completeness, and 0.88 points in accuracy. The advantage of ChatCPG is even more pronounced when compared to Azure OpenAI, achieving a superior performance by 1.50 points in clarity, 1.88 points in detail, 2.25 points in completeness, and 1.88 points in accuracy. These results underscore ChatCPG’s effectiveness in delivering enhanced clarity, detail, completeness, and accuracy in its responses.
Table 2: Graded quality metrics for ChatCPG, CPG Human Expert, & Azure OpenAI
In addition to pure performance metrics, we also measured the time to complete a task. The time it takes for a human or AI to perform a task can give us insights into its transformational capabilities. For our chatCPG platform, the mean time to solve a task is 77.3 seconds +/- 30.4 seconds with the 95% confidence interval being [16.5,138.1]. In contrast the human CPG expert had a mean time to solve a task is 1111.1 seconds +/- 139.5 seconds with the 95% confidence interval being [832.1,1390.1]. Thus on average the chatCPG platform was 1,311% faster than the CPG human expert.
Azure OpenAI performs slightly faster on average than ChatCPG 38.5 seconds +/- 5.21 seconds with the 95% confidence interval being [38.6,49.0]. Despite Azure OpenAI being faster on average, the ChatCPG platform is quicker on 63% of tasks because times are skewed by outliers. This is because Azure OpenAI is most likely using a vanilla vector search for their RAG systems (with a few simpler augmentations such as HyDE). Vector search is quick with low variance, but as our accuracy measures show, more complex abstractions and architectures are needed for actual downstream efficiencies.
What else is Tickr working on?
In the aspect of versatility, ChatCPG exhibited a remarkable ability to incorporate the latest research and state-of-the-art technologies into its processes. This included the adoption of Hypothetical Document Embeddings (HyDE), Forward-Looking Active REtrieval (FLARE), Hybrid Fusion, Sentence Windowing, Ensemble Refinement (ER) and much more. These progressive features enhance the platform’s capacity to deliver accurate results, even amidst complex analytical tasks, showing it can be a workhorse for knowledge workers.
The field of Generative AI is moving at a rapid pace and we are continuing to innovate and conduct research to serve our clients the best RAG systems possible. As mentioned above, one of the most exciting pieces of research Tickr is doing is the unsupervised discovery of the optimal RAG architectures. This involves an automated robust data labeling pipelines and performance testing over several state-of-the-art RAG architectures to serve each of our clients the best solution possible. See our Cold Start & Tuning for Retrieval Augmented Generation Part 1 & 2.
The power and potential of RAG with Tickr’s ChatCPG platform are evident in the impressive results of our blinded internal experiment. This advanced platform has shown an astounding ability to process and analyze data an order of magnitude faster than human experts, all while maintaining high levels of accuracy and detail. The unique features of the system, including adaptive query engines, chain-of-thought reasoning, and language model rerankers, ensure the quality of results, empowering knowledge workers a high-level of efficiency.
Tickr’s ChatCPG platform also sets itself apart through its commitment to incorporating cutting-edge technologies such as FLARE, HyDE, and CPG-tuned embeddings. Put bluntly we aren’t an OpenAI wrapper. Each and every one of our clients (or partners) will get a custom-tailored RAG solution optimized for their use cases.
Investing in Tickr’s ChatCPG platform is a strategic move towards embracing the future of knowledge intensive work. Please reach out to email@example.com to better understand how Tickr can help you automate your knowledge intensive tasks and business workflows.
- Publish Date
- September 12th, 2023
- Large language models (LLMs) are revolutionizing and automating the field of data analysis & knowledge-intensive tasks. The realm of Consumer Packaged Goods (CPG) is ripe for large transformational gains leveraging large language models. In this study we run a double-blinded experiment to assess the transformational capabilities of LLM systems.
- Sam Kahn