Building LLMs in the Open-Source Community: A Call to Action for Investment Professionals


ChatGPT and other natural language processing (NLP) chatbots have democratized access to powerful large language models (LLMs), delivering tools that facilitate more sophisticated investment techniques and scalability. This is changing how we think about investing and reshaping roles in the investment profession.

I sat down with Brian Pisaneschi, CFA, senior investment data scientist at CFA Institute, to discuss his recent report, which provides investment professionals the necessary comfort to start building LLMs in the open-source community.

The report will appeal to portfolio managers and analysts who want to learn more about alternative and unstructured data and how to apply machine learning (ML) techniques to their workflow.

Subscribe Button

“Staying abreast of technological trends, mastering programming languages for parsing complex datasets, and being keenly aware of the tools that augment our workflow are necessities that will propel the industry forward in an increasingly technical investment domain,” Pisaneschi says.

Unstructured Data and AI: Fine-Tuning LLMs to Enhance the Investment Process” covers  some of the nuances of one area that is rapidly redefining modern investment processes — alternative and unstructured data. Alternative data differ from traditional data — like financial statements — and are often in an unstructured form like PDFs or news articles, Pisaneschi explains.

More sophisticated algorithmic methods are required to gain insights from these data, he advises. NLP, the subfield of ML that parses spoken and written language, is particularly suited to dealing with many alternative and unstructured datasets, he adds.

ESG Case Study Demonstrates Value of LLMs

The combination of advances in NLP, an exponential rise in computing power, and a thriving open-source community has fostered the emergence of generative artificial intelligence (GenAI) models. Critically, GenAI, unlike its predecessors, has the capacity to create new data by extrapolating from the data on which it is trained.

In his report, Pisaneschi demonstrates the value of building LLMs by presenting an environmental, social, and governance (ESG) investing case study, showcasing their use in identifying material ESG disclosures from company social media feeds. He believes ESG is an area that is ripe for AI adoption and one for which alternative data can be used to exploit inefficiencies to capture investment returns.

NLP’s increasing prowess and the growing insights being mined from social media data motivated Pisaneschi to conduct the study. He laments, however, that since the study was conducted in 2022, some of the social media data used are no longer free. There is a growing recognition of the value of data AI companies require to train their models, he explains.

Fine-Tuning LLMs

LLMs have innumerable use cases due to their ability to be customized in a process called fine-tuning. During fine-tuning, users create bespoke solutions that incorporate their own preferences. Pisaneschi explores this process by first outlining the advances of NLP and the creation of frontier models like ChatGPT. He also provides a structure for starting the fine-tuning process.

The dynamics of fine-tuning smaller language model vs using frontier LLMs to perform classification tasks have changed since ChatGPT’s launch. “This is because traditional fine-tuning requires significant amounts of human-labeled data, whereas frontier models can perform classification with only a few examples of the labeling task.” Pisaneschi explains.

Traditional fine-tuning on smaller language models can still be more efficacious than using large frontier models when the task requires a significant amount of labeled data to understand the nuance between classifications.

The Power of Social Media Alternative Data

Pisaneschi’s research highlights the power of ML techniques that parse alternative data derived from social media. ESG materiality could be more rewarding in small-cap companies, due to the new capacity to gain closer to real-time information from social media disclosures than from sustainability reports or investor conference calls, he points out. “It emphasizes the potential for inefficiencies in ESG data particularly when applied to a smaller company.”

He adds, “The research showcases the fertile ground for using social media or other real time public information. But more so, it emphasizes how once we have the data, we can customize our research easily by slicing and dicing the data and looking for patterns or discrepancies in the performance.”

The study looks at the difference in materiality by market capitalization, but Pisaneschi says other differences could be analyzed, such as the differences in industry, or a different weighting mechanism in the index to find other patterns.

“Or we could expand the labeling task to include more materiality classes or focus on the nuance of the disclosures. The possibilities are only limited by the creativity of the researcher,” he says. 

CFA Institute Research and Policy Center’s 2023 survey — Generative AI/Unstructured Data, and Open Source – is a valuable primer for investment professionals. The survey, which received 1,210 responses, dives into what alternative data investment professionals are using and how they are using GenAI in their workflow.

The survey covers what libraries and programming languages are most valuable for various parts of the investment professional’s workflow related to unstructured data and provides valuable open-source alternative data resources sourced from survey participants.

Ad for CFA Institute Research and Policy Center

The future of the investment profession is strongly rooted in the cross collaboration of artificial and human intelligence and their complementary cognitive capabilities. The introduction of GenAI may signal a new phase of the AI plus HI (human intelligence) adage.



You May Also Like