To (use) LLM or not to LLM: A Case-Study with Tabular Data
The advent of foundational Large Language Model (LLMs) brings forth the promise of a universal prediction engine. In this article, we seek to understand whether such a promise is realized.
Abstract
The advent of foundational Large Language Model (LLMs) brings forth the promise of a universal prediction engine. In this article, we seek to understand whether such a promise is realized. Precisely, we wish to understand if there are classes of prediction problems for which LLMs are not yet competitive with respect to traditional prediction methods.
As the main conclusion, using textbook style benchmark datasets, we find that traditional approaches perform much better on (structured) regression and classification tasks as compared to LLMs. On the other hand, whenever the prediction task involves unstructured data (e.g., long text features), LLMs outperform traditional approaches.
To utilize LLMs for a specific prediction task at hand, one either tunes the LLM with the task-specific data or utilizes relevant context for available data. Tuning LLM is expensive if not infeasible for general-purpose use. To provide relevant context, one needs infrastructure that can comb out relevant context for each prediction query in real-time. Again, this is not easy for a generic user to be able to build.
At Ikigai Labs, we have built an easy-to-use (drag-n-drop) interface, the aiLLM, that enables one to use LLM as a generic prediction engine with the ease of spreadsheet like skills. We utilize aiLLM as well as other prediction methods available out of the box within Ikigai platform to generate these results.
New models bring new solutions
As LLMs become popular and accessible, they create new ways to solve prediction problems. These foundational models seem to hold promise to become universal predictors. They are trained by utilizing internet scale data and can extract information out of structured as well as unstructured (e.g., text) data. While their ability to provide reasonably good answers to natural language queries is excellent, their ability to continually adapt the responses to queries as the context of the ongoing “conversation” builds is simply remarkable.
Adapting to recent conversation or context has made LLMs excellent candidates for universal prediction engines – using the training data as the context, and the test or future data as queries. And indeed, because LLMs are able to utilize structured and unstructured data, they become ideal candidates for ease of utilization. The only question remains whether they are accurate enough.
To understand this, we study the following simple question:
How does LLM based prediction compare with traditional Machine Learning approaches?
Testing LLMs against traditional machine learning approaches
To answer the above question, we need to define how to utilize LLMs for generic prediction tasks, decide which traditional approaches to use, and the benchmark datasets over which we perform comparison.
To adapt LLM for prediction, there are two approaches – either tune LLM for each prediction task or utilize training data for building appropriate context for each prediction query. We utilize the latter approach that is carefully designed by Ikigai Labs through their aiLLM feature. Details are given below.
To compare the performance of aiLLM, we utilize the traditional Random Forest based approach. For ease of developing benchmarks and comparison, we utilize this feature from Ikigai’s platform. Details are given below.
We compare performance of these two competing methods by utilizing a collection of benchmark datasets available for both regression and classification tasks. The dataset involves structured as well as unstructured features. We utilize standard accuracy metrics to evaluate (and compare) performance. Details are given below.
Method 1. aiLLM at Ikigai Labs
At Ikigai Labs, we have built an LLM-based predictor that makes it easy for anyone to utilize an LLM for performing prediction tasks of various sorts using an LLM. It does not require knowing prompt engineering or having infrastructure; instead, it leverages simple drag-n-drop functionality.
Specifically, it uses a combination of nearest neighbors and an LLM to create the aiLLM model. The inputs to aiLLM are two datasets: one to train and one to test. These datasets have the same columns. The output is the column with predicted values created by the LLM.
We tested aiLLM with two different prompts. The first is for zero-shot classification, where no examples are given to the LLM to classify the information. In this prompt, we only give the LLM which categories it could classify the row into.
Context: You will be given an array of exactly {batch_number} strings that contain {feature_columns_string}. The goal is to classify them into the following categories: {target_column_values_string} and reply with an array of exactly {batch_number} values, each entry represents the category of the {batch_number} original input strings.
The second prompt is for predicting output values using the three nearest neighbors to the target row. Nearest neighbors are determined by first creating numeric representations of each of the rows in the training data and then determining which of these representations are closest to the target row’s numeric representation.
Context: Given the following array of exactly {batch_number} entries. Each entry will have exactly three examples of intended outputs, and then an input labeled predict for which you must predict the output. The outputs replace the nan values in the inputs. These are the column headers: {column_headers}. Reply with an array of exactly {batch_number} values, one for each value that has replaced nan in the Predict In string.
Template for nearest neighbors information:
Ex 1:
In: first nearest neighbor feature column values
Out: first nearest neighbor target column value
Ex 2:
In: second nearest neighbor feature column values
Out: second nearest neighbor target column value
Ex 3:
In: third nearest neighbor feature column values
Out: third nearest neighbor target column value
Predict:
In: feature column values of the row we are trying to predict
[LLM should complete the “Predict” part of the prompt]
Method 2. Random Forest
We compared the target column output produced by aiLLM to the same output produced by a random forest model. A random forest aggregates the outputs of several decision trees to create final predictions. A decision tree is a graphical representation that uses a series of choices to make a prediction.
Datasets
We used six different datasets: four to test classification and two to test regression (for Boston Housing, we used two different target columns).
Benchmark
For regression, we compared the r2 values, mean squared error and mean absolute error. For classification, we compared the accuracy and the balanced accuracy.
We used six different datasets that all had different compositions of text and numeric data. We sorted these datasets by “Task Type.” Task Type has two components. The first was whether it was a regression or classification problem. The second was whether the dataset’s features were “nontraditional” or “traditional.” A dataset is traditional if it contains numerical and categorical variables or text that is not informative to the prediction (for example, the name column of the Titanic dataset). A dataset is nontraditional if it has one or more text (not categorical) columns that are necessary for predicting the value of the target.
Metric Data Below
Classification Metrics
Regression Metrics
Results
The LLM performed better when the tabular data contained text-heavy information (IMDB Movie Reviews or Tweet Sentiments). Conversely, the LLM would be worse than traditional methods if the data was primarily numeric. The LLM was also better at classification tasks than regression tasks because the prompt for classification gives the LLM a fixed number of options for predictions (only the values that are categories of the target column).
Below are the takeaways for classification and regression specific tasks.
Classification
LLM predictions did the best with zero shot classification when the features contained long text descriptions.
Above are the three datasets we tested with aiLLM zero shot classification (random forest is not zero shot). aiLLM performed the best with IMDB movie reviews because the dataset contained a “review” column which had several paragraphs of review information which gives the LLM enough information to classify 93% of the data correctly. Further, the potential output categories are distinct– positive or negative. In the cases where the LLM model did not classify correctly, the reviews often had mixed signals that could have confused the model. See the following review which is negative, but the model classified as positive.
This same type of error happens much more often when the target column labels are less distinct. For example, for the Short Tweet Sentiments dataset, the LLM model performed much worse than the IMDB dataset (69.3% accuracy). Often, the model mixed up positive with neutral and negative with neutral (less often positive with negative). The model also could have performed worse because it did not have enough context to be sure about the sentiment in only a short tweet (as opposed to the long descriptions in the IMDB movie reviews).
The Titanic dataset classification posed a different problem for aiLLM– it was a traditional classification problem: its non-categorical text columns (such as Name and Ticket) provide no information about the outcome.
The only non-numeric column that helps predict the Survived column is the Sex column. This allows for predictions with 71.3% accuracy because the categories are distinct (survived or didn’t survive). However, the random forest predicted outcome with 78% accuracy because it was able to use the numerical columns to formulate better predictions.
There are also instances of traditional classification where you cannot do zero shot classification because the names and values of the feature columns don’t reflect explicitly how they are related to the target column. For example, for the Default dataset, the target column (whether someone defaults on financial obligations) is predicted by finding a pattern between balance to be paid off and a person’s income, so you need information besides the potential target column categories (Yes and No)
In this case, we used the second prompt which used nearest neighbors as an additional part of the context for the LLM. Below are the metrics for the Default dataset.
The LLM performed with 91.3% accuracy whereas a model that just filled every value with “No” would have performed with 96.7% accuracy. These extra “Yes” values and mislabeled No values resulted from the nearest neighbors being passed into the LLM. If the Student column was "Yes” (which happens 30% of the time), the LLM prediction was also influenced by this value and could have predicted “Yes.” Because the LLM is more influenced by the categorical values, for the Default dataset it was not able to produce accurate predictions. The random forest performed with 97.5% accuracy, demonstrating that for target values that are predicted mainly by numeric columns, it is preferable to use a random forest model.
Regression
The LLM model performs consistently worse than the random forest model on regression for all the datasets (Boston Housing is traditional regression, while NYC Rental Housing Prices is nontraditional regression because it also contains a description of the house in one of the columns).
See the metrics forBoston Housing (target column: median value).
For Boston Housing specifically, none of the column names or numbers provided the LLM with context for understanding the data and thus, it did worse than a random forest model which does best with recognizing numerical patterns.
See the metrics for Boston Housing with target column age below. It performed poorly when compared to random forest because even though "age” might be a better column description than “medv,” it is unclear to the LLM what “age” refers to– it could have answered logical human ages when instead, the model wanted ages of houses. When a dataset is traditional, it is important that the LLM understand what type of data it is predicting, which is not possible with medv and age in Boston Housing.
The aiLLM model’s poor performance on traditional regression problems can be explained by its inconsistency with numbers. We asked the LLM to explain how it was choosing values for the Boston Housing dataset and it only focused on the three target column values of the nearest neighbors, often explaining that it was either taking the mean or median of those numbers. However, sometimes when the LLM explained that it took the mean, it was calculated incorrectly.
For nontraditional regression, these same problems still occurred. For the NYC Rental Housing Prices dataset, the description of the house was not enough to predict the price and thus aiLLM guessed numbers that were similar to the target values of the nearest neighbors. Because LLMs are not made to handle numerical pattern recognition and random forests depend on numerical pattern analysis for predictions, any regression is better suited to a random forest model.
Time Comparison
In general, the random forest models are much faster at delivering results.
We ran aiLLM and the random forest model on different numbers of testing rows (10,50,100,150,200) for one dataset. For regression, we used Boston Housing and for classification, we used the IMDB review dataset.
The part of aiLLM that consumes the most time is the token generation. The more text aiLLM generates, the longer it takes. For regression and classification, that means it is on the faster side because it only needs to generate one number or word for each row. However, it still is much slower than the random forest model at creating predictions.
Conclusion
Understanding when to use LLMs is necessary to leverage its capacities as a universal predictor. In its current form, the only case where aiLLM would still be a better fit for tabular data is for classification with long form text (like IMDB Reviews and Short Tweet Sentiments). For all the other datasets, the aiLLM performed poorly compared to the traditional random forest method.
About the Authors
Arth Dharaskar is a Machine Learning team lead at Ikigai. He holds a Bachelor's degree in Computer Science from Rutgers and a Master's degree in Electrical and Computer Engineering, specializing in ML/DS, from the University of California, San Diego. He enjoys building scalable ML systems.
Eliza Knapp is a sophomore studying Applied Mathematics with Computer Science at Harvard University. She currently works at Ikigai as a Machine Learning intern.
Acknowledgements
We would like to thank Katie Lenahan, Nate Lanier, Parvathi Narayan and Devavrat Shah for their help and feedback while writing this blog post.