A Good Description Is All You Need | by Ilia Teimouri

I have started my analysis by obtaining data from HuggingFace. The dataset is called financial-reports-sec (This dataset has Apache License 2.0 and permits for commercial use), and according to the dataset authors, it contains the annual reports of U.S. public companies filing with the SEC EDGAR system from 1993–2020. Each annual report (10-K filing) is divided into 20 sections.

Two relevant attributes of this data are useful for the current task:

Sentence: Excerpts from the 10-K filing reports
Section: Labels denoting the section of the 10-K filing that the sentence belongs to

I have focused on three sections:

Business (Item 1): Describes the company’s business, including subsidiaries, markets, recent events, competition, regulations, and labor. Denoted by 0 in the data.
Risk Factors (Item 1A): Discusses risks that could impact the company, such as external factors, potential failures, and other disclosures to warn investors. Denoted by 1.
Properties (Item 2): Details significant physical property assets. Does not include intellectual or intangible assets. Denoted by 3.

For each label, I sampled 10 examples without replacement. The data is structured as follows:

Once the data is ready, all I have to do is to make a classifier function that takes the sentence from the dataframe and predicts the label.

Role = '''
You are expert in SEC 10-K forms. 
You will be presented by a text and you need to classify the text into either 'Item 1', 'Item 1A' or 'Item 2'. 
The text only belongs to one of the mentioned categories so only return one category.
'''
def sec_classifier(text): response = openai.ChatCompletion.create(
model='gpt-4',
messages=[
{
"role": "system",
"content": Role},
{
"role": "user",
"content": text}],
temperature=0,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0)
return response['choices'][0]['message']['content']

I’m using GPT-4 here since it’s OpenAI’s most capable model so far. I’ve also set the temperature to 0 just to make sure the model does not go off track. The really fun part is how I define the Role — that’s where I get to guide the model on what I want it to do. The Role tells it to stay focused and deliver the kind of output I’m looking for. Defining a clear role for the model helps it generate relevant, high-quality responses. The prompt in this function is:

You are expert in SEC 10-K forms.
You will be presented by a text and you need to classify the text into either ‘Item 1’, ‘Item 1A’ or ‘Item 2’.
The text only belongs to one of the mentioned categories so only return one category.

After applying the classification function across all data rows, I generated a classification report to evaluate model performance. The macro average F1 score was 0.62, indicating reasonably strong predictive capabilities for this multi-class problem. Since the number of examples was balanced across all 3 classes, the macro and weighted averages converged to the same value. This baseline score reflects the out-of-the-box accuracy of the pretrained model prior to any additional tuning or optimization.

               precision    recall  f1-score   supportItem 1       0.47      0.80      0.59        10
Item 1A       0.80      0.80      0.80        10
Item 2       1.00      0.30      0.46        10
accuracy                           0.63        30
macro avg       0.76      0.63      0.62        30
weighted avg       0.76      0.63      0.62        30

As mentioned, few-shot learning is all about generalising the model with a few good examples. To that end, I’ve modified my class by describing what Item 1, Item 1A and Item2 are (based on Wikipedia):

Role_fewshot = '''
You are expert in SEC 10-K forms. 
You will be presented by a text and you need to classify the text into either 'Item 1', 'Item 1A' or 'Item 2'. 
The text only belongs to one of the mentioned categories so only return one category.
In your classification take the following definitions into account: Item 1 (i.e. Business) describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. 
It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, have complex labor requirements, which have significant effects on the business.) 
Other topics in this section may include special operating costs, seasonal factors, or insurance matters.
Item 1A (i.e. Risk Factors) is the section where the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.
Item 2 (i.e. Properties) is the section that lays out the significant properties, physical assets, of the company. This only includes physical types of property, not intellectual or intangible property.
Note: Only state the Item.
'''
def sec_classifier_fewshot(text): 
response = openai.ChatCompletion.create(
model='gpt-4',
messages=[
{
"role": "system",
"content": Role_fewshot},
{
"role": "user",
"content": text}],
temperature=0,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0)
return response['choices'][0]['message']['content']

The prompt now reads:

You are expert in SEC 10-K forms.
You will be presented by a text and you need to classify the text into either ‘Item 1’, ‘Item 1A’ or ‘Item 2’.
The text only belongs to one of the mentioned categories so only return one category.
In your classification take the following definitions into account:

Item 1 (i.e. Business) describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in.
It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, have complex labor requirements, which have significant effects on the business.)
Other topics in this section may include special operating costs, seasonal factors, or insurance matters.

Item 1A (i.e. Risk Factors) is the section where the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.

Item 2 (i.e. Properties) is the section that lays out the significant properties, physical assets, of the company. This only includes physical types of property, not intellectual or intangible property.

If we run this on the texts we get the following performance:

                precision    recall  f1-score   supportItem 1       0.70      0.70      0.70        10
Item 1A       0.78      0.70      0.74        10
Item 2       0.91      1.00      0.95        10
accuracy                           0.80        30
macro avg       0.80      0.80      0.80        30
weighted avg       0.80      0.80      0.80        30

The macro average F1 is now 0.80, that is 29% improvement in our prediction, only by providing a good description of each class.

Finally you can see the full dataset:

In fact the examples I provided gives the model concrete instances to learn from. Examples allow the model to infer patterns and features, by looking at multiple examples, the model can start to notice commonalities and differences that characterise the overall concept being learned. This helps the model form a more robust representation. Furthermore, providing examples essentially acts as a weak form of supervision, guiding the model towards the desired behaviour in lieu of large labeled datasets.

In the few-shot function, concrete examples help point the model to the types of information and patterns it should pay attention to. In summary, concrete examples are important for few-shot learning as they provide anchor points for the model to build an initial representation of a novel concept, which can then be refined over the few examples provided. The inductive learning from specific instances helps models develop nuanced representations of abstract concepts.

If you’ve enjoyed reading this and want to keep in touch, you can find me on my LinkedIn or via my webpage: iliateimouri.com

Note: All images, unless otherwise noted, are by the author.

Source link