If you are not a Medium member, you can read the full article from this link.
With the increasing efficiency of Large Language Models (LLMs), they are becoming increasingly popular for information extraction from business documents such as legal contracts, invoices, financial reports, and resumes, to name a few. The information extracted from multiple sources could be used for matchmaking and recommendation systems.
Some of the applications of information extraction and matchmaking include:
- Automatic request-for-quotation (RFQ) generation by extracting information from customers’ requests
- Extracting key usage patterns from customer’s data to provide product recommendations
- Extracting key information from tenders and matching it with company profiles to find the potential bidders
- Extracting key information from a company’s invoices and sales documents to generate sale-purchase reports
- Extracting key information from purchase orders to facilitate inventory and supply chain management
- Matching individuals on dating or matrimonial platforms based on their profiles, etc.
During my AI consultancy experience in the FAIR EDIH project in Finland, I encountered several match-making use cases that could be implemented by employing LLMs for information extraction and subsequently providing recommendations aligned with the extracted information. Some of these use cases include:
- Matching users’ preferences for buying cars and houses
- Mapping student skills with career pathways in a learning management system
- Matching EU regulations and compliances with tender proposals
- Recommending experts for reviewing a research proposal
- Providing peer recommendations to students for enhancing the learning experience
- Upskilling recommendations for a company’s employees based on their profiles
- Relocating applicants to workplaces matching with their profiles, to name a few.
In this article, I will discuss a use case of extracting key information from a job seeker’s Curriculum Vitae (CV) or resume and recommending jobs aligned with the job seeker’s profile from a job database. The method applies to both CVs and resumes; however, I will only use the term "CV" throughout the article. This use case is very useful for job search platforms that want to integrate AI into their existing system. Such platforms maintain a job database and allow users to create profiles and/or upload their CVs. The same method can also be applied to help recruiters find potential candidates who match the job ads.
In this article, we will develop an application with a simple GUI to analyze an uploaded CV to extract a profile comprising educational credentials, skills, and professional experience, and subsequently recommend top matching jobs matching with the profile, along with an explanation for each selection.
It is important to note that this example use case can be extended for several other information extraction and matchmaking tasks.
This article will cover the following topics:
- Utilizing LlamaParse and Pydantic models to extract structured information from documents using an LLM.
- Applying this information extraction method to CVs to extract educational credentials, skills, and professional experience.
- Scoring the extracted skills based on their strength (semantic score) in the CV.
- Creating a job vector database from a curated list of job ads.
- Retrieving top matching jobs from the vector database based on their semantic similarity with the extracted profile.
- Generating the final job recommendations with an LLM with an explanation for each recommendation.
- Developing a simple streamlit application, allowing the selection of multiple LLMs and embedding models (both OpenAI and open-source).
The whole code can be found in my GitHub repository with complete instructions.
There are two main folders in the repository: i) the code in the folder OpenAI models uses OpenAI’s gpt-4o LLM and text-embedding-3-large embedding model, and ii) the code in the folder Multiple models offers the option to select OpenAI as well as open-source LLMs (e.g., gpt-4o, gpt-4o-mini, _llama3:70b-instruct-q40, mistral:latest, llama3.3:latest) and embedding models (e.g., text-embedding-3-large, text-embedding-3-small, BAAI/bge-small-en-v1.5).
You will need an OpenAI’s API key to run the code in the [OpenAI models](https://github.com/umairalipathan1980/CV-Analyzer-Job-Recommender/tree/main/OpenAI%20models) folder. However, if you have a powerful PC with a CUDA-enabled GPU, you can test the code in Multiple models folder with open-source models for free. You can run this code even without a CUDA-enabled GPU, but the processing will be too slow. Both the codes are flexible to add more LLMs and/or embedding models for experiments. For the sake of simplicity, I will only refer to the code in OpenAI models in this article.
The following figure shows the overall process.

Following is a snapshot of the streamlit application.

Parsing with LlamaParse and Information Extraction & Validation with Pydantic Models
In the following article, I demonstrated information extraction from unstructured documents using LLMs. Here, I used python-docx library to extract text from AI consultancy documents (MS WORD) and directly send the text of each document to an LLM for information extraction.
LLM-Powered Parsing and Analysis of Semi-Structured & Structured Documents
In another article, I demonstrated a better parsing method using LlamaParse for a contextual, multimodal Retrieval Augment Generation (RAG). LlamaParse is a genAI-based document parsing platform that parses and cleans data, ensuring it is of good quality and in proper format before passing it to an LLM. Please see the abovementioned article to set up LlamaParse and get its free API key.
In this article, I will use LlamaParse to parse data from a CV. However, instead of directly extracting the required information from the parsed content using an LLM, I will use Pydantic models to enforce a specific schema for information extraction and validate the extracted information against the given schema. This process ensures that the output generated by an LLM conforms to the expected types and formats. Pydantic validation also helps to reduce LLM hallucinations.
Pydantic offers a clean and concise way to define data models using Python classes. Before discussing the Pydantic-guided information extraction from CVs, I will first start with an example to demonstrate this process for any document. I will use the same example document of AI consultancy for companies as used in the abovementioned article to extract key, structured information from an AI consultancy document. Here is the example document.
This is the AI consultancy of the company Sagittarius Tech on the date 2024-09-12. This was a regular session facilitated by the expert Klaus Muller. Sagittarius Tech, based in Finland, is a forward-thinking, well-established company specializing in renewable energy solutions. They have a strong technical foundation in renewable energy systems, particularly in solar and wind energy, but their application of AI technology is still in its infancy, leading to a current AI maturity level that is considered low.
The company's objectives are well articulated and focus on optimizing the efficiency of their energy distribution networks. Specifically, Sagittarius Tech aims to implement AI-driven predictive maintenance for their solar farms and wind turbines. Their current approach to maintenance is largely reactive, with inspections carried out at regular intervals or when a failure is detected. This method, while functional, is neither cost-effective nor efficient, as it often leads to unexpected downtime and higher maintenance costs. By integrating AI into their maintenance operations, Sagittarius Tech hopes to predict and prevent equipment failures before they occur, thereby reducing downtime and extending the lifespan of their energy assets.
The idea of implementing predictive maintenance using AI is highly relevant and aligns with current industry trends. By predicting equipment failures before they happen, Sagittarius Tech can improve the reliability of their energy systems and offer more consistent service to their clients. The application of AI for this purpose is particularly advantageous, as it allows for the analysis of large datasets from sensors and monitoring equipment to identify patterns and anomalies that might indicate impending failures.
While the company's immediate goals are clear, their long-term strategy for AI integration is still under consideration. However, they have identified their target market as large-scale renewable energy operators and utility companies. In terms of data requirements, Sagittarius Tech has access to extensive datasets generated by the sensors installed on their solar panels and wind turbines. This data, which includes temperature readings, vibration analysis, and energy output metrics, is crucial for training and validating AI models for predictive maintenance. The data is continuously updated as part of their ongoing operations, providing a rich source of information for AI-driven insights.
The company has demonstrated strong technical expertise in renewable energy systems and in managing the associated data. They have a growing interest in AI, particularly in the area of predictive analytics, though their experience in this field is still developing. Sagittarius Tech is seeking technical assistance from FAIR Services to develop an AI proof-of-concept (POC) focused on predictive maintenance for their energy assets. During the consultation, it was noted that the company could benefit from targeted training in AI-based predictive maintenance techniques to further their capabilities.
The experts suggested that the challenge of implementing predictive maintenance could be approached through the use of machine learning models that are specifically designed to handle time-series data. Models such as LSTM (Long Short-Term Memory) networks, which are particularly effective in analyzing sequential data, can be applied to the sensor data collected by Sagittarius Tech. These models are capable of learning patterns over time and can provide early warnings of potential equipment failures. However, the experts noted that these models require a significant amount of data for training, so it may be beneficial to begin with a smaller pilot project before scaling up.
The experts further recommended exploring the integration of AI-driven predictive maintenance tools with the company's existing monitoring systems. This integration can be achieved through the use of custom APIs and middleware, allowing the AI models to continuously analyze incoming data and provide real-time alerts to the maintenance team. Additionally, the experts emphasized the importance of a hybrid approach, combining AI predictions with human expertise to ensure that maintenance decisions are both data-driven and informed by practical experience.
Starting with pre-trained models for time-series analysis was recommended, with the option to fine-tune these models based on the specific characteristics of Sagittarius Tech's equipment and operations. It was advised to avoid training models from scratch due to the computational complexity and resource requirements involved. Instead, a phased approach to AI integration was suggested, where the predictive maintenance system is gradually rolled out across different sites, allowing the models to be refined and validated in a controlled environment. This approach ensures that the AI system can be effectively integrated into the company's operations without disrupting existing processes.
We have hundreds of such unstructured documents and the aim is to extract the following key information: company name, country, consultation date, experts, consultation type, area domain, current solution, AI field, AI maturity level, technical expertise and capability, company type, aim, identified target market, data requirement assessment, FAIR’s services sought, and recommendations.
The following libraries must be installed before running the given codes.
pip install openai pydantic[email] llama_parse llama-index python-dotenv pydantic[email] streamlit
The following code defines a Pydantic model to enforce a particular schema for data extraction, validating LLM’s output, and converting the format of some fields into an expected format.
import os
import json
import openai
from datetime import datetime, date
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator
from llama_parse import LlamaParse
from llama_index.llms.openai import OpenAI
from dotenv import load_dotenv
from llama_index.core import SimpleDirectoryReader
load_dotenv() #load the API keys from .env file
class AIconsultation(BaseModel):
company_name: Optional[str] = Field(None, description="The name of the company seeking AI advisory")
country: Optional[str] = Field(None, description="The company's country")
consultation_date: Optional[str] = Field(None, description="The date of consultation")
experts: Optional[List[str]] = Field(None, description="The experts providing AI consultancy")
consultation_type: Optional[str] = Field(None, description="Type of consultation: Regular or pop-up")
area_domain: Optional[str] = Field(None, description="The field of the company's operations (e.g., healthcare, logistics, etc.)")
current_solution: Optional[str] = Field(None, description="A brief summary of the current solution (e.g., Recommendation System, professional guidance system)")
ai_field: Optional[List[str]] = Field(None, description="AI sub-fields in use or required (e.g., computer vision, generative AI)")
ai_maturity_level: Optional[str] = Field(None, description="AI maturity level: low, moderate, high")
technical_expertise_and_capability: Optional[str] = Field(None, description="Company's technical expertise: low, moderate, or high")
company_type: Optional[str] = Field(None, description="Company type: startup or established company")
aim: Optional[str] = Field(None, description="Main AI task the company is aiming for")
identified_target_market: Optional[str] = Field(None, description="The targeted customers (e.g., healthcare professionals, construction firms)")
data_requirement_assessment: Optional[str] = Field(None, description="Type of data required for AI integration with format/modality")
fair_services_sought: Optional[str] = Field(None, description="Services expected from FAIR (e.g., technical advice, proof of concept)")
recommendations: Optional[str] = Field(None, description="Key recommendations focusing on most important suggested actions")
@field_validator("consultation_date", mode="before")
def validate_and_convert_date(cls, raw_date):
if raw_date is None:
return None
if isinstance(raw_date, str):
# List of acceptable date formats
date_formats = ['%d-%m-%Y', '%Y-%m-%d', '%d/%m/%Y', '%m-%d-%Y']
for fmt in date_formats:
try:
# Attempt to parse the date string with the current format
parsed_date = datetime.strptime(raw_date, fmt).date()
# Return the date in MM-DD-YYYY format as a string
return parsed_date.strftime('%m-%d-%Y')
except ValueError:
continue # Try the next format
# If none of the formats match, raise an error
raise ValueError(
f"Invalid date format for 'consultation_date'. Expected one of: {', '.join(date_formats)}."
)
if isinstance(raw_date, date):
# Convert date object to MM-DD-YYYY format
return raw_date.strftime('%m-%d-%Y')
raise ValueError(
"Invalid type for 'consultation_date'. Must be a string or a date object."
)
def extract_content(file_path):
"""Parse the document and extract its content as text."""
#Initialize LlamaParse parser
parser = LlamaParse(
result_type="markdown",
parsing_instructions="Extract each section separately based on the document structure.",
auto_mode=True,
api_key=os.getenv("LLAMA_API_KEY"),
verbose=True
)
file_extractor = {".pdf": parser}
# Load the document
documents = SimpleDirectoryReader(
input_files=[file_path], file_extractor=file_extractor
).load_data()
text_content = "n".join([doc.text for doc in documents])
return text_content
def extract_information(document_text, llm_model):
"""Extract structured information and validate with Pydantic schema."""
openai.api_key = os.getenv("OPENAI_API_KEY")
llm = OpenAI(model=llm_model, temperature=0.0)
prompt = f"""
You are an expert in analyzing consultation documents. Use the following JSON schema to extract relevant information:
```json
{AIconsultation.schema_json(indent=2)}
```json
Extract the information from the following document and provide a structured JSON response strictly adhering to the schema above.
Please remove any ```json ``` characters from the output. Do not make up any information. If a field cannot be extracted, mark it as `n/a`.
Document:
----------------
{document_text}
----------------
"""
response = llm.complete(prompt)
if not response or not response.text:
raise ValueError("Failed to get a response from LLM.")
try:
parsed_data = json.loads(response.text) # Parse the response text to a Python dictionary
return AIconsultation.model_validate(parsed_data) # Validate the parsed data against the schema
except Exception as e:
raise ValueError(f"Validation failed: {e}")
if __name__ == "__main__":
# Path to the document to analyze
document_path = "Sagittarius.pdf"
if not os.path.exists(document_path):
raise FileNotFoundError(f"The file {document_path} does not exist.")
try:
print("Extracting content from the document...")
document_content = extract_content(document_path)
print("Parsing and extracting structured information...")
consultation_info = extract_information(document_content, llm_model="gpt-4o")
print("Extraction complete. Here is the structured information:")
print(json.dumps(consultation_info.dict(), indent=2))
except Exception as e:
print(f"An error occurred: {e}")
The description of all fields in AIconsultation
class is self-explanatory. The field validator function validate_and_convert_date
checks the format of the extracted consultation_date
field and converts it into a required format (dd-mm-yyyy
) if required. The function extract_content()
parses the given AI consultancy document using Llamaparse, and the function extract_information()
extracts the required information from the document using gpt-4o
LLM, guided by the Pydantic model. The prompt
in extract_information
function instructs the model to follow the Pydantic schema and output the response in a JSON format.
Llamaprse splits the documents into multiple sub-documents based on the overall context. As per the instructions given to the parser (see parsing_instructions
in extract_content()
function), the parser creates multiple sections and assigns each section a heading. The parser’s output (document
object in extract_content()
function) contains sub-document id’s, meta data of each sub-document, and the text containing multiple sections with headings.

document
object in `extract_content()` function (image by author)I only select text (text_content
in extract_content()
function) for information extraction by LLM. Here is the final output of extract_content()
function. The document has been split into multiple sections, with each section assigned a heading.

Finally, the extract_information()
function extracts the required information (defined in the Pydantic model) from the parsed content in a nicely structured format. The consultation date was validated and converted into dd-mm-yyyy
format. Note that in prompt
, we do not need to specify what information we want to extract, as this has been specified in Pydantic models.
Extracting content from the document...
Started parsing the file under job_id 0761bfee-922a-49a8-9e92-da1877aeea1a
Parsing and extracting structured information...
Extraction complete. Here is the structured information:
{
"company_name": "Sagittarius Tech",
"country": "Finland",
"consultation_date": "09-12-2024",
"experts": [
"Klaus Muller"
],
"consultation_type": "Regular",
"area_domain": "Renewable energy",
"current_solution": "Reactive maintenance for solar farms and wind turbines",
"ai_field": [
"Predictive maintenance",
"Machine learning",
"Time-series analysis"
],
"ai_maturity_level": "Low",
"technical_expertise_and_capability": "High",
"company_type": "Established company",
"aim": "Implement AI-driven predictive maintenance for solar farms and wind turbines",
"identified_target_market": "Large-scale renewable energy operators and utility companies",
"data_requirement_assessment": "Extensive datasets from sensors on solar panels and wind turbines, including temperature readings, vibration analysis, and energy output metrics",
"fair_services_sought": "Technical advice, proof of concept for predictive maintenance",
"recommendations": "Use machine learning models like LSTM for time-series data, start with pre-trained models, integrate AI with existing systems using custom APIs, and adopt a phased approach to AI integration"
}
Parsing CV Content and Information Extraction & Validation Using Pydantic Models
After a demonstration of parsing documents with LlamaParse and information extraction via Pydantic models, let’s now discuss parsing and key information extraction from CVs using Pydantic models and provide job recommendations aligned with the extracted profile. Here, the extracted educational credentials, skills, and past experience have been considered sufficient information to provide matching job recommendations.
The GitHub code is structured into two .py files:
CV_analyzer.py
: defines the Pydantic models, configures the LLM and the embedding model, parses CV’s data, extracts the required information from CV, assigns scores to the extracted skills, and retrives matching jobs from the job vector database.job_recommender.py
: initializes a Streamlit application, calls the functions in CV_analyzer.py in a sequential manner, and displays the extracted information and job recommendations.
The overall function of the code is depicted in the following image.

CvAnalyzer
and RAGStringQueryEngine
for CV parsing, Pydantic guided profile extraction with LLM, skill scoring, and job recommendation with Streamlit output (image by author).A few structures in this code have been adopted from this source with significant enhancements. Let’s discuss all the classes and functions in the code one by one.
The following code in CV_analyzer.py
shows the definitions of the Pydantic models.
# Pydantic model for extracting education
class Education(BaseModel):
institution: Optional[str] = Field(None, description="The name of the educational institution")
degree: Optional[str] = Field(None, description="The degree or qualification earned")
graduation_date: Optional[str] = Field(None, description="The graduation date (e.g., 'YYYY-MM')")
details: Optional[List[str]] = Field(
None, description="Additional details about the education (e.g., coursework, achievements)"
)
@field_validator('details', mode='before')
def validate_details(cls, v):
if isinstance(v, str) and v.lower() == 'n/a':
return []
elif not isinstance(v, list):
return []
return v
# Pydantic model for extracting experience
class Experience(BaseModel):
company: Optional[str] = Field(None, description="The name of the company or organization")
location: Optional[str] = Field(None, description="The location of the company or organization")
role: Optional[str] = Field(None, description="The role or job title held by the candidate")
start_date: Optional[str] = Field(None, description="The start date of the job (e.g., 'YYYY-MM')")
end_date: Optional[str] = Field(None, description="The end date of the job or 'Present' if ongoing (e.g., 'MM-YYYY')")
responsibilities: Optional[List[str]] = Field(
None, description="A list of responsibilities and tasks handled during the job"
)
@field_validator('responsibilities', mode='before')
def validate_responsibilities(cls, v):
if isinstance(v, str) and v.lower() == 'n/a':
return []
elif not isinstance(v, list):
return []
return v
# Main Pydantic class ensapsulating education and epxerience classes with other information
class ApplicantProfile(BaseModel):
name: Optional[str] = Field(None, description="The full name of the candidate")
email: Optional[EmailStr] = Field(None, description="The email of the candidate")
age: Optional[int] = Field(
None,
description="The age of the candidate."
)
skills: Optional[List[str]] = Field(
None, description="A list of high-level skills possessed by the candidate."
)
experience: Optional[List[Experience]] = Field(
None, description="A list of experiences detailing previous jobs, roles, and responsibilities"
)
education: Optional[List[Education]] = Field(
None, description="A list of educational qualifications of the candidate including degrees, institutions studied in, and dates of start and end."
)
@root_validator(pre=True)
def handle_invalid_values(cls, values):
for key, value in values.items():
if isinstance(value, str) and value.lower() in {'n/a', 'none', ''}:
values[key] = None
return values
The Education
class in the pydantic model defines the extraction of educational details including institution’s name, the degree or qualification earned, the graduation date, and additional details like coursework or achievements. The Experience
class defines the extraction of professional experience details including company name, location, role, start and end dates, and a list of responsibilities or tasks. The main class ApplicantProfile``encapsulates the
Educationand
Experience`classes, along with other candidate-specific information like name, email, age, and skills. The field validators in each class handle the conversion of invalid and irrelevant values (such as n/a or ‘none’) or improperly formatted inputs into a consistent data format.
After defining the Pydantic models,CV_analyzer.py
uses aCvAnalyzer
class with the following structure for performing various tasks.
# Class for analyzing the CV contents
class CvAnalyzer:
def __init__(self, file_path, llm_option, embedding_option):
"""
Initializes the CvAnalyzer with the given resume file path and model options.
Parameters:
- file_path: Path to the resume file.
- llm_option: Name of the LLM to use.
- embedding_option: Name of the embedding model to use.
"""
pass
def _model_settings(self):
"""
Configures the large language model and embedding model based on the user-provided options.
This ensures that the selected models are properly initialized and ready for use.
"""
pass
def extract_profile_info(self) -> ApplicantProfile:
"""
Extracts structured information from the resume and converts it into an ApplicantProfile object.
This includes parsing education, skills, and experience using a selected LLM.
"""
pass
def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
"""
Generates embeddings for a list of text inputs using the specified embedding model.
This function is called by compute_skill_scores() function
Parameters:
- texts: List of strings to embed.
- model: Name of the embedding model.
Returns:
- Tensor of embeddings.
"""
pass
def compute_skill_scores(self, skills: list[str]) -> dict:
"""
Computes semantic similarity scores between skills and the resume content.
Parameters:
- skills: List of skills to evaluate.
Returns:
- A dictionary mapping each skill to its similarity score.
"""
pass
def _extract_resume_content(self) -> str:
"""
Called by compute_skill_scores(), this function extracts and returns the raw textual content of the resume.
"""
pass
def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
"""
Called by compute_skill_scores() function, calculates the cosine similarity between two vectors.
Parameters:
- vec1: First vector.
- vec2: Second vector.
Returns:
- Cosine similarity score as a float.
"""
pass
def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
"""
Creates a new job vector index from a JSON dataset or loads an existing index from storage.
Parameters:
- json_file: Path to the job dataset JSON file.
- index_folder: Folder to save or load the vector index.
Returns:
- VectorStoreIndex object for querying jobs.
"""
pass
def query_jobs(self, education, skills, experience, index, top_k=3):
"""
Queries the job vector index to find the top-k matching jobs based on the provided profile.
Parameters:
- education: List of educational qualifications.
- skills: List of skills.
- experience: List of work experiences.
- index: Job vector database index.
- top_k: Number of top matching jobs to retrieve (default: 3).
Returns:
- List of job matches.
"""
pass
extract_profile_info
function first parses the given CV with LlamaParse and splits it into sections (as demonstrated in the example presented in the beginning of the article). It then sends the CV’s contents self._resume_content
to the LLM with the Pydantic schema and information extraction instructions (see prompt
). The response from the LLM (response
) is validated against the Pydantic schema.
It is worth mentioning that instead of the original parsed content (document
) with metadata and other information, I extract the text data (self._resume_content
) and send it to the LLM for information extraction. This prevents the LLM from becoming confused between the information scattered across different nodes, which could result in omitting some parts of the required information.
def extract_profile_info(self) -> ApplicantProfile:
"""
Extracts candidate data from the resume.
"""
print(f"Extracting CV data. LLM: {self.llm_option}")
output_schema = ApplicantProfile.model_json_schema()
parser = LlamaParse(
result_type="markdown",
parsing_instructions="Extract each section separately based on the document structure.",
auto_mode=True,
api_key=os.getenv("LLAMA_API_KEY"),
verbose=True
)
file_extractor = {".pdf": parser}
# Load resume and parse
documents = SimpleDirectoryReader(
input_files=[self.file_path], file_extractor=file_extractor
).load_data()
# Split into sections
self._resume_content = "n".join([doc.text for doc in documents])
prompt = f"""
You are an expert in analyzing resumes. Use the following JSON schema to extract relevant information:
```json
{output_schema}
```json
Extract the information from the following document and provide a structured JSON response strictly adhering to the schema above.
Please remove any ```json ``` characters from the output. Do not make up any information. If a field cannot be extracted, mark it as `n/a`.
Document:
----------------
{self._resume_content}
----------------
"""
try:
response = self.llm.complete(prompt)
if not response or not response.text:
raise ValueError("Failed to get a response from LLM.")
parsed_data = json.loads(response.text)
return ApplicantProfile.model_validate(parsed_data)
except Exception as e:
print(f"Error parsing response: {str(e)}")
raise ValueError("Failed to extract insights. Please ensure the resume and query engine are properly configured.")
The compute_skill_scores
function computes the embeddings of each extracted skill and that of the CV contents. It then computes a Cosine similarity score between the skill and CV embeddings. The more prominent a skill is in a CV, the higher the Cosine similarity score it gets. This Cosine similarity score for each skill is normalized between 0 and 5 to display in a 5-star format.
def compute_skill_scores(self, skills: list[str]) -> dict:
"""
Compute semantic weightage scores for each skill based on the resume content
Parameters:
- skills (list of str): A list of skills to evaluate.
Returns:
- dict: A dictionary mapping each skill to a score
"""
# Extract resume content and compute its embedding
resume_content = self._extract_resume_content()
# Compute embeddings for all skills at once
skill_embeddings = self._get_embedding(skills, model=self.embedding_model.model_name)
# Compute raw similarity scores and semantic frequency for each skill
raw_scores = {}
for skill, skill_embedding in zip(skills, skill_embeddings):
# Compute semantic similarity with the entire resume
similarity = self._cosine_similarity(
self._get_embedding([resume_content], model=self.embedding_model.model_name)[0],
skill_embedding
)
raw_scores[skill] = similarity
return raw_scores
def _extract_resume_content(self) -> str:
"""
Returns the CV contents previously extracted
"""
if self._resume_content:
return self._resume_content # Use the pre-stored content
else:
raise ValueError("Resume content not available. Ensure `extract_profile_info` is called first.")
def _get_embedding(self, texts: List[str], model: str) -> torch.Tensor:
"""Computes embeddings based on the selected embedding model.
These could be CV embeddings, skill embeddings, or job embeddings """
from openai import OpenAI
client = OpenAI(api_key=openai.api_key)
response = client.embeddings.create(input=texts, model=model)
embeddings = [torch.tensor(item.embedding) for item in response.data]
return torch.stack(embeddings)
def _cosine_similarity(self, vec1: torch.Tensor, vec2: torch.Tensor) -> float:
"""
Compute cosine similarity between a skill and the CV content.
"""
vec1, vec2 = vec1.to(self.device), vec2.to(self.device)
return (torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))).item()
The function create_or_load_job_index
creates a new job vector database or loads an index from an existing job vector database (see job_index_storage
folder in the code repository).
def create_or_load_job_index(self, json_file: str, index_folder: str = "job_index_storage"):
"""
Create or load a vector database for jobs using LlamaIndex.
"""
if not os.path.exists(index_folder):
print(f"Creating new job vector index with {self.embedding_model.model_name} model...")
with open(json_file, "r") as f:
job_data = json.load(f)
# Convert job descriptions to Document objects by serializing all fields dynamically
documents = []
for job in job_data["jobs"]:
job_text = "n".join([f"{key.capitalize()}: {value}" for key, value in job.items()])
documents.append(Document(text=job_text))
# Create the vector index directly from documents
index = VectorStoreIndex.from_documents(documents, embed_model=self.embedding_model)
# Save index to disk
index.storage_context.persist(persist_dir=index_folder)
return index
else:
print(f"Loading existing job index from {index_folder}...")
storage_context = StorageContext.from_defaults(persist_dir=index_folder)
return load_index_from_storage(storage_context)
The job dataset, from which the vector database is created, is in sample_jobs.json
file in the code repository. I curated this example dataset by scrapping 50 job ads from different sources in JSON format. Here is how the job ads are stored in this file.
{
"jobs": [
{
"id": "2253637",
"title": "Director of Customer Success",
"company": "HEI Schools",
"description": "HEI Schools is seeking an experienced Director of Customer Success to lead our account management, customer success, and project delivery functions. Responsibilities include overseeing seamless product and service delivery, ensuring high quality and customer satisfaction, and supervising a team of three customer success professionals. The role requires regular international travel and reports directly to the CEO.",
"image": "n/a",
"location": "Helsinki, Finland",
"employmentType": "Full-time, Permanent",
"datePosted": "December 10, 2024",
"salaryRange": "n/a",
"jobProvider": "Jobly",
"url": "https://www.jobly.fi/en/job/director-customer-success-2253637"
},
{
"id": "2258919",
"title": "Service Specialist",
"company": "Stora Enso",
"description": "We are seeking an active and service-oriented Service Specialist for our forest owner services in the Helsinki metropolitan area. Responsibilities include supporting timber sales and service sales, marketing and communication within your area of responsibility, forest consulting, promoting digital solutions in customer management and service offerings, and stakeholder collaboration in the metropolitan area.",
"image": "n/a",
"location": "Helsinki, Finland",
"employmentType": "Permanent, Full-time",
"datePosted": "December 10, 2024",
"salaryRange": "n/a",
"jobProvider": "Jobly",
"url": "https://www.jobly.fi/en/job/palveluasiantuntija-2258919"
}
...
],
"index": 0,
"jobCount": 50,
"hasError": false,
"errors": []
}
The function query_jobs
retrieves top_k
matching job ads from the job vector database which are then sent to the LLM for final recommendation.
def query_jobs(self, education, skills, experience, index, top_k=3):
"""
Query the vector database for jobs matching the extracted profile.
"""
print(f"Fetching job suggestions.(LLM: {self.llm.model}, embed_model: {self.embedding_option})")
query = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}"
# Use retriever with appropriate model
retriever = index.as_retriever(similarity_top_k=top_k)
matches = retriever.retrieve(query)
return matches
The CvAnalyzer
class and its above-mentioned methods are initialized and called by job_recommender.py
which serves as the main application code. job_recommender.py
uses the following custom query engine to provide final job recommendations with the function.
class RAGStringQueryEngine(BaseModel):
"""
Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
"""
retriever: BaseRetriever
llm: OpenAI
qa_prompt: PromptTemplate
# Allow arbitrary types
model_config = ConfigDict(arbitrary_types_allowed=True)
def custom_query(self, candidate_details: str, retrieved_jobs: str):
query_str = self.qa_prompt.format(
query_str=candidate_details, context_str=retrieved_jobs
)
response = self.llm.complete(query_str)
return str(response)
The main function in job_recommender.py
works as follows:
def main():
#Streamlit messages
st.set_page_config(page_title="CV Analyzer & Job Recommender", page_icon="🔍 ")
st.title("CV Analyzer & Job Recommender")
st.write("Upload a CV to extract key information.")
uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Choose a PDF file up to 5MB")
#Define LLM and embedding model
llm_option = "gpt-4o"
embedding_option = "text-embedding-3-large"
#Following code is trigerred after pressing 'Analyze' button
if uploaded_file is not None:
if st.button("Analyze"):
with st.spinner("Parsing CV... This may take a moment."):
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
# Initialize CvAnalyzer with selected models
analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
print("Resume extractor initialized.")
# Extract insights from the resume
insights = analyzer.extract_profile_info()
print("Candidate data extracted.")
# Load or create job vector index
job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
# Extract education, skills, and experience fields from insights object
education = [edu.degree for edu in insights.education] if insights.education else []
skills = insights.skills or []
experience = [exp.role for exp in insights.experience] if insights.experience else []
#Retrieve the top_k matching jobs
matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
#combine the retrieved matching jobs
retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs])
#combine the profile information
candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}"
#Initialize LLM and the query engine
llm = OpenAI(model=llm_option, temperature=0.0)
rag_engine = RAGStringQueryEngine(
retriever=job_index.as_retriever(),
llm=analyzer.llm,
qa_prompt=PromptTemplate(template="""
You are expert in analyzing resumes, based on the following candidate details and job descriptions:
Candidate Details:
---------------------
{query_str}
---------------------
Job Descriptions:
---------------------
{context_str}
---------------------
Provide a concise list of the matching jobs. For each matching job, mention job-related details such as
company, brief job description, location, employment type, salary range, URL for each suggestion, and a brief explanation of why the job matches the candidate's profile.
Be critical in matching profile with the jobs. Thoroughly analyze education, skills, and experience to match jobs.
Do not explain why the candidate's profile does not match with the other jobs. Do not include any summary. Order the jobs based on their relevance.
Answer:
"""
),
)
#send the profile details and the retrieved jobs to the LLM for final recommendation
llm_response = rag_engine.custom_query(
candidate_details=candidate_details,
retrieved_jobs=retrieved_context
)
# Display extracted information
st.subheader("Extracted Information")
st.write(f"**Name:** {insights.name}")
st.write(f"**Email:** {insights.email}")
st.write(f"**Age:** {insights.age}")
list_education(insights.education or [])
with st.spinner("Extracting skills..."):
list_skills(insights.skills or [], analyzer)
list_experience(insights.experience or [])
st.subheader("Top Matching Jobs with Explanation")
st.markdown(llm_response)
print("Done.")
except Exception as e:
st.error(f"Failed to analyze the resume: {str(e)}")
The main function initializes CVAnalyzer
class with the selected models and calls extract_profile_info
function to extract profile information. It then loads the job vector index and calls query_jobs
function to retrieve the jobs matching with the extracted profile. Subsequently, it initializes the query engine (rag_engine
) and sends the retrieved jobs (retreived_context
) and profile information (candidate_details
) to the LLM with instructions on what aspects to consider to generate the final job recommendations (see qa_prompt
).
import torch
from transformers import AutoTokenizer, AutoModel
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.ollama import Ollama
from typing import Union
import streamlit as st
import tempfile
import random
import os
from CV_analyzer import CvAnalyzer
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.llms.openai import OpenAI
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel, Field, ConfigDict
class RAGStringQueryEngine(BaseModel):
"""
Custom Query Engine for Retrieval-Augmented Generation (fetching matching job recommendations).
"""
retriever: BaseRetriever
llm: OpenAI
qa_prompt: PromptTemplate
# Allow arbitrary types
model_config = ConfigDict(arbitrary_types_allowed=True)
def custom_query(self, candidate_details: str, retrieved_jobs: str):
query_str = self.qa_prompt.format(
query_str=candidate_details, context_str=retrieved_jobs
)
response = self.llm.complete(query_str)
return str(response)
def main():
st.set_page_config(page_title="CV Analyzer & Job Recommender", page_icon="🔍 ")
st.title("CV Analyzer & Job Recommender")
llm_option = "gpt-4o"
embedding_option = "text-embedding-3-large"
st.write("Upload a CV to extract key information.")
uploaded_file = st.file_uploader("Select Your CV (PDF)", type="pdf", help="Choose a PDF file up to 5MB")
if uploaded_file is not None:
if st.button("Analyze"):
with st.spinner("Parsing CV... This may take a moment."):
try:
with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
temp_file.write(uploaded_file.getvalue())
temp_file_path = temp_file.name
# Initialize CvAnalyzer with selected models
analyzer = CvAnalyzer(temp_file_path, llm_option, embedding_option)
print("Resume extractor initialized.")
# Extract insights from the resume
insights = analyzer.extract_profile_info()
print("Candidate data extracted.")
# Load or create job vector index
job_index = analyzer.create_or_load_job_index(json_file="sample_jobs.json", index_folder="job_index_storage")
# Extract education, skills, and experience fields from insights object
education = [edu.degree for edu in insights.education] if insights.education else []
skills = insights.skills or []
experience = [exp.role for exp in insights.experience] if insights.experience else []
#Retrieve the top_k matching jobs
matching_jobs = analyzer.query_jobs(education, skills, experience, job_index)
#combine the retrieved matching jobs
retrieved_context = "nn".join([match.node.get_content() for match in matching_jobs])
#combine the profile information
candidate_details = f"Education: {', '.join(education)}; Skills: {', '.join(skills)}; Experience: {', '.join(experience)}"
#Initialize LLM and the query engine
llm = OpenAI(model=llm_option, temperature=0.0)
rag_engine = RAGStringQueryEngine(
retriever=job_index.as_retriever(),
llm=analyzer.llm,
qa_prompt=PromptTemplate(template="""
You are expert in analyzing resumes, based on the following candidate details and job descriptions:
Candidate Details:
---------------------
{query_str}
---------------------
Job Descriptions:
---------------------
{context_str}
---------------------
Provide a concise list of the matching jobs. For each matching job, mention job-related details such as
company, brief job description, location, employment type, salary range, URL for each suggestion, and a brief explanation of why the job matches the candidate's profile.
Be critical in matching profile with the jobs. Thoroughly analyze education, skills, and experience to match jobs.
Do not explain why the candidate's profile does not match with the other jobs. Do not include any summary. Order the jobs based on their relevance.
Answer:
"""
),
)
#send the profile details and the retrieved jobs to the LLM for final recommendation
llm_response = rag_engine.custom_query(
candidate_details=candidate_details,
retrieved_jobs=retrieved_context
)
# Display extracted information
st.subheader("Extracted Information")
st.write(f"**Name:** {insights.name}")
st.write(f"**Email:** {insights.email}")
st.write(f"**Age:** {insights.age}")
list_education(insights.education or [])
with st.spinner("Extracting skills..."):
list_skills(insights.skills or [], analyzer)
list_experience(insights.experience or [])
st.subheader("Top Matching Jobs with Explanation")
st.markdown(llm_response)
print("Done.")
except Exception as e:
st.error(f"Failed to analyze the resume: {str(e)}")
The following three functions displays educational credentials, skills, and experience. The function list_skills
calls compute_skill_scores
function to compute the cosine similarity score for each skill and then converts each score in a 5-star rating.
def list_skills(skills: list[str], analyzer):
"""
Display skills with their computed scores as large golden stars with full or partial coverage.
"""
if not skills:
st.warning("No skills found to display.")
return
st.subheader("Skills")
# Custom CSS for large golden stars
st.markdown(
"""
<style>
.star-container {
display: inline-block;
position: relative;
font-size: 1.5rem;
color: lightgray;
}
.star-container .filled {
position: absolute;
top: 0;
left: 0;
color: gold;
overflow: hidden;
}
</style>
""",
unsafe_allow_html=True,
)
# Compute scores for all skills
skill_scores = analyzer.compute_skill_scores(skills)
# Display each skill with a star rating
for skill in skills:
score = skill_scores.get(skill, 0) # Get the raw score
max_score = max(skill_scores.values()) if skill_scores else 1 # Avoid division by zero
# Normalize the score to a 5-star scale
normalized_score = (score / max_score) * 5 if max_score > 0 else 0
# Split into full stars and partial star percentage
full_stars = int(normalized_score)
if (normalized_score - full_stars) >= 0.40:
partial_star_percentage = 50
else:
partial_star_percentage = 0
# Generate the star display
stars_html = ""
for i in range(5):
if i < full_stars:
# Fully filled star
stars_html += '<span class="star-container"><span class="filled">★</span>★</span>'
elif i == full_stars:
# Partially filled star
stars_html += f'<span class="star-container"><span class="filled" style="width: {partial_star_percentage}%">★</span>★</span>'
else:
# Empty star
stars_html += '<span class="star-container">★</span>'
# Display skill name and star rating
st.markdown(f"**{skill}**: {stars_html}", unsafe_allow_html=True)
def list_education(education_list):
"""
Display a list of educational qualifications.
"""
if education_list:
st.subheader("Education")
for education in education_list:
#extract metrics for each education (degree) and display it
institution = education.institution if education.institution else "Not found"
degree = education.degree if education.degree else "Not found"
year = education.graduation_date if education.graduation_date else "Not found"
details = education.details if education.details else []
formatted_details = ". ".join(details) if details else "No additional details provided."
st.markdown(f"**{degree}**, {institution} ({year})")
st.markdown(f"_Details_: {formatted_details}")
def list_experience(experience_list):
"""
Display a single-level bulleted list of experiences.
"""
if experience_list:
st.subheader("Experience")
for experience in experience_list:
#extract metrics for each experience and display it
job_title = experience.role if experience.role else "Not found"
company_name = experience.company if experience.company else "Not found"
location = experience.location if experience.location else "Not found"
start_date = experience.start_date if experience.start_date else "Not found"
end_date = experience.end_date if experience.end_date else "Not found"
responsibilities = experience.responsibilities if experience.responsibilities else ["Not found"]
brief_responsibilities = ", ".join(responsibilities)
st.markdown(
f"- Worked as **{job_title}** from {start_date} to {end_date} in *{company_name}*, {location}, "
f"where responsibilities include {brief_responsibilities}."
)
See the following snapshots of the profile information extraction and job recommendations from the Streamlit app. The sample CV (Sample CV.pdf) can be found in the code repository.



The app allows you to select open-source models if you have a powerful computer with a CUDA-enabled GPU. You can still run the open-source models after setting up the required libraries as mentioned in the repository’s instructions. However, with a CPU-only machine, the processing could be terribly slow. That said, the code can automatically switch to CPU-based processing if it doesn’t find any CUDA-enabled GPU.
The following screenshot shows the processing of the same CV with open-source models: Llama3.1:latest LLM and BAAI/bge-small-en-v1.5 embeddings model.


The open-source models also do a pretty decent job. It is interesting to note how Llama3.1 extracts and scores skills. It splits the combinations of skills mentioned in the CV into individual skills. However, they still remain relevant for providing matching job recommendations.
Following is the snapshot of the first page of my CV (1 out of 4 pages), showing how it is structured.

Here are the results of the analysis of my CV. Quite good for my profile.



Directions for Improvements and Extension
There is plenty of room to improve this application. Some of the potential improvements could be in the following directions.
- The current code uses a limited, sample dataset. An efficient job ad ingestion pipeline could be implemented to update the job database with recent jobs.
- I only considered education credentials, skills, and past experience for job recommendations. More information could be extracted for even better recommendations by extending the Pydantic models. This information could include location, more recent experience, publications, and other interests/activities.
- The skill-scoring method could be further improved and integrated with the final recommendations to consider individual skill scores.
- The application could be extended to extract key information from job ads and match them with job seekers’ profiles to find the matching candidates.
- The application can be tested for CVs with diverse formats to check the parsing quality and improve the parsing accordingly.
- Based on the analysis of the CV contents, the user could be provided recommendations to improve his CV and skill and suggest upskilling packages.
Please find the whole code in the following GitHib repository:
GitHub – umairalipathan1980/CV-Analyzer-Job-Recommender: CV Analyzer & Job Recommender
That’s all folks! If you liked the article, please clap the article (multiple times 👏 ), write a comment, and follow me on Medium and LinkedIn.