Skip to main content

AI-powered document indexing helping title companies

New utilization of conversational AI (artificial intelligence) and large language models (LLM) in document indexing by Columbus, Ohio-based Razi Title is already providing new efficiency for clients. In an exclusive interview with The Title Report, Razi Title Chief Technology Officer Robert Zwink outlined these additions to his company’s platform, which were made in December. “In our experience, there’s generally always a backlog of paperwork via title plants or data warehouses or starter management systems,” he said. “It all seems to lead back to there being a stack of paper everywhere. In order to streamline that, you have to know what you’re working with. “The ability to use AI to classify documents right down to the page level can help clear out those stacks.As we clear out paperwork, there’s a consumer benefit. When you think about the wait that can happen, or the variability in closings, it’s often a stack of paperwork that you have to get through. AI can streamline operations an...

Harnessing Advanced Machine Learning for Text Classification at Razi Title: A Technical Exploration

At Razi Title, our commitment to innovation drives us to explore and implement cutting-edge technologies. In this blog, I'll take you through our journey in developing a sophisticated text classification model using TensorFlow, delving into the advanced machine learning techniques we've employed.

import os

from layers.attention_layers import ScaledDotProductAttention

# Set environment variable to make TensorFlow use CPU
os.environ["CUDA_VISIBLE_DEVICES"] = ""
Our journey begins with preparing our machine learning environment. Choosing TensorFlow, a robust and versatile framework, allows us to build complex models with relative ease. The decision to run computations on the CPU is opportunistic, as the PC we used has a modern i9 processor which out performed the Nvidia 12GB Vram video card. Extra bonus, ensuring our model remains deployable in diverse environments without depending on specific hardware capabilities helps.

  
import os
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Dense, Dropout, Embedding, Input, Flatten
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf

from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical


def load_classified_data(input_directory):
    data = []
    labels = []

    for label_dir in os.listdir(input_directory):
        full_dir_path = os.path.join(input_directory, label_dir)
        if os.path.isdir(full_dir_path):
            for file_name in os.listdir(full_dir_path):
                file_path = os.path.join(full_dir_path, file_name)
                with open(file_path, "r") as file:
                    text = file.read()
                    data.append(text)
                    labels.append(label_dir)

    return data, labels
Our load_classified_data function is more than just a data loader; it's a gateway to structured machine learning. By categorizing text from various directories, we lay a solid foundation for supervised learning. This method ensures that our model can differentiate between distinct categories, which is vital in automating document classification in real estate transactions. Essentially, put text files into folders. Those folders become labels.

if __name__ == "__main__":
    input_directory = "/home/rzwink/working/cinnabar/ld/20231228094239"
    data, labels = load_classified_data(input_directory)

    NUM_WORDS = 1000
    EMBEDDING_DIM = 16
    MAX_LENGTH = 500
    # Train the model
    NUM_EPOCHS = 20
    BATCH_SIZE = 32  # You can adjust this value as needed

    vectorizer = TextVectorization(
        max_tokens=NUM_WORDS, output_sequence_length=MAX_LENGTH, output_mode="int"
    )
    vectorizer.adapt(data)

    # Save the vectorizer vocabulary
    vocab = vectorizer.get_vocabulary()
    vocab_dict = {word: index for index, word in enumerate(vocab)}

    # Encoding the labels
    label_encoder = LabelEncoder()
    encoded_labels = label_encoder.fit_transform(labels)

    # Save the label encoder
    np.save(f"models/page_classifier/label_encoder.npy", label_encoder.classes_)

    X_train, X_test, y_train, y_test = train_test_split(
        np.array([[s] for s in data]),
        to_categorical(encoded_labels),
        test_size=0.2,
        random_state=42,
    )

    # Define the model with ScaledDotProductAttention
    text_input = Input(shape=(1,), dtype=tf.string)  # Adjusted for a batch of strings
    x = vectorizer(text_input)
    x = Embedding(NUM_WORDS, EMBEDDING_DIM, input_length=MAX_LENGTH)(x)

    # Create query, key, value for attention
    query = Dense(EMBEDDING_DIM)(x)
    key = Dense(EMBEDDING_DIM)(x)
    value = Dense(EMBEDDING_DIM)(x)

In this section, we dive into the nuances of text vectorization and label encoding. Text vectorization converts raw text into numerical tokens, a format that neural networks can understand and process. This step is crucial in extracting meaningful patterns from text data, enabling the model to learn from linguistic structures. Our use of a limited vocabulary size (NUM_WORDS) and sequence length (MAX_LENGTH) balances model complexity with computational efficiency.
  
    # Apply Scaled Dot Product Attention
    attention_output, _ = ScaledDotProductAttention()(query, key, value)

    x = Flatten()(attention_output)

    x = Dense(24, activation="relu")(x)
    x = Dropout(0.5)(x)
    outputs = Dense(len(label_encoder.classes_), activation="softmax")(x)

    model = Model(text_input, outputs)

    # Compile and train the model
    model.compile(
        loss="categorical_crossentropy",
        optimizer=Adam(learning_rate=0.001),
        metrics=["accuracy"],
    )
    # Define the early stopping callback
    early_stopping = EarlyStopping(
        monitor="val_loss",  # Monitor the validation loss
        patience=3,  # Number of epochs with no improvement after which training will be stopped
        restore_best_weights=True,  # Restore model weights from the epoch with the best value of the monitored quantity
    )

    # Train the model with early stopping
    model.fit(
        X_train,
        y_train,
        epochs=NUM_EPOCHS,
        batch_size=BATCH_SIZE,
        validation_data=(X_test, y_test),
        callbacks=[early_stopping],  # Add the callback here
    )
    # Save the model
    model.save("models/page_classifier", save_format="tf")
Training a deep learning model is a delicate balance between learning enough and not overfitting. Our use of the EarlyStopping callback is a testament to our focus on creating a robust model. This approach ensures that our model retains its ability to generalize to new, unseen data, which is crucial in the dynamic field of real estate where new document types and terminologies emerge regularly.

    # Evaluate the model
    y_pred = model.predict(X_test)
    y_pred_labels = label_encoder.inverse_transform([np.argmax(y) for y in y_pred])
    y_test_labels = label_encoder.inverse_transform([np.argmax(y) for y in y_test])

    # Print classification report
    print(classification_report(y_test_labels, y_pred_labels, zero_division=1))  
The evaluation phase is where theory meets practice. By closely examining the classification report, we gain insights into the model's performance across various document types. This helps us fine-tune the model for even better accuracy in real-world applications, such as automating title searches and document verification processes at Razi Title. Through this deep dive into our machine learning model, we at Razi Title are not just embracing technological advancement; we're actively shaping it to fit our unique industry needs. This model represents a significant step in our ongoing journey to revolutionize the real estate sector with AI-driven solutions.

Popular Posts

AI-powered document indexing helping title companies

New utilization of conversational AI (artificial intelligence) and large language models (LLM) in document indexing by Columbus, Ohio-based Razi Title is already providing new efficiency for clients. In an exclusive interview with The Title Report, Razi Title Chief Technology Officer Robert Zwink outlined these additions to his company’s platform, which were made in December. “In our experience, there’s generally always a backlog of paperwork via title plants or data warehouses or starter management systems,” he said. “It all seems to lead back to there being a stack of paper everywhere. In order to streamline that, you have to know what you’re working with. “The ability to use AI to classify documents right down to the page level can help clear out those stacks.As we clear out paperwork, there’s a consumer benefit. When you think about the wait that can happen, or the variability in closings, it’s often a stack of paperwork that you have to get through. AI can streamline operations an...

Conversational AI like Chat GPT has applications for title industry

As ChatGPT and similar artificial intelligence (AI) models gain popularity, title professionals from Washington, D.C.-based Razi Title, Inc. are finding ways to use those innovations to improve everyday efficiency. Razi Title Chief Technology Officer Robert Zwink and Chief Executive Officer Lili Farhandi sat down with  The Title Report  to discuss what they’ve dubbed “TitleGPT,” a conversational AI built in-house that specializes in title industry-related documentation. “We didn’t build this from total scratch,” Zwink said. “We’ve made use of the same technology that ChatGPT uses. Ours is unique, standalone and purpose-built for indexing title documents. When I first started interacting with ChatGPT, I asked if there was a word for the anxiety someone can feel in harnessing this type of technology. “Lili and I are very passionate about strengthening the title community. At the end of the day, we want to put this at the fingertips of all title professionals, distill this tech d...

Tech Tip: How Mutool Can Enhance Your Real Estate Title Operations: The Next-Level PDF Repair Tool

Hello to all of our dedicated real estate professionals, As the Chief Technology Officer of Razi, I often work with various stakeholders within the real estate industry. And what I consistently find is that we all face a common struggle: dealing with problematic PDF files. Now, before you dismiss this as a minor issue, consider the critical role that PDF files play in real estate transactions. These documents contain contracts, property details, and critical customer information. A problem with a PDF can delay a deal, create confusion, or even lead to losses. While there are numerous tools available to address these challenges, the most common choice in the industry has been Ghostscript. Don't get me wrong; Ghostscript has been an instrumental tool in handling PDFs for many years. However, we've found that it has its limitations, particularly when it comes to repairing damaged or problematic PDF files. This can be problematic for real estate title offices, where precision and a...