Extracting a table create and insert statements with data from .SQL file

Introduction

Recently, we embarked on a migration project for an application built with CakePHP, which supported PHP 7.4. While the initial stages were smooth, we faced a challenge when we examined the database. It was using MySQL version 5.7, which is quite outdated. Therefore, we decided to also migrate the database. However, due to certain limitations imposed by the project team, we only received the database dumps without any additional context regarding the environments or other relevant information.

We set out to install a newer version of MySQL, specifically 8.0.x. Given that the database dump was quite large, over 116GB, we understood that we couldn't simply execute the import without careful consideration. As a result, we opted to plan our approach meticulously. Let's see how we can move forward from here.

Scene 1: 

We configured the MYSQL Server and established a database to restore the backup.

In first step we just run the command with nohup like:

sudo mysql -u root database_name < source_dump.sql

This command just started the restore and as the database size was too large, so we used nohup like:

sudo su -

nohup mysql --binary-mode=1 -u root database_name < source_dump.sql > restore_output.log 2>&1 &​ 

It appears to be functioning well initially, but after a few hours of operation, it encountered an error due to a version mismatch with MySQL. The source dump was created using MySQL version 5.7, while the restoration was attempted on version 8.0.

We fixed the problem with searching and later again run it.

Scene 2

As we proceeded with the database restoration, we encountered some issues with particular tables. Moreover, our system had a storage limitation of approximately 300+GB, which included a 116+GB backup file. After the restoration, we were left with only 50GB of available storage.

Later we thought to split the database dump in multiple chunks of file, with search on create table statement​, but it also require storage.


We decided to run the application and wanted to see the breaking tables. We listed down the breaking tables and started thing to extract that specific table create​ and insert​ data from the large SQL file.

In first try we were using sed -n '3426,$p' source_data_dump.sql > extracted_table.sql

But for this we need to know the line number and ending line number of the table alsong with insert​. This was tricky, and we left with no answer. 

Later we decided to extract the table data using program, so we written a simple python program to extract the specific table data, where program will take 3 inputs as

sourceDB.sql , output.sql,  and the table_name.

Checkout the program here:

import sys
if len(sys.argv) != 4:
    print("Usage: python extract_table.py <input.sql> <output.sql> <table_name>")
    sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
table_name = sys.argv[3]
inside_create = False
with open(input_file, 'r', encoding='utf-8', errors='ignore') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
    for line in infile:
        # Optionally, capture the DROP statement if it exists
        if f"DROP TABLE IF EXISTS `{table_name}`" in line:
            outfile.write(line)
        # Start capturing the CREATE TABLE block
        if not inside_create and f"CREATE TABLE `{table_name}`" in line:
            inside_create = True
        if inside_create:
            outfile.write(line)
            # Assuming that the CREATE TABLE statement ends at a semicolon on its own line or at the end of a line.
            if line.strip().endswith(';'):
                inside_create = False
        # Capture any INSERT statements for the table.
        # (Assuming INSERT statements start on a new line.)
        if line.strip().startswith(f"INSERT INTO `{table_name}`"):
            outfile.write(line)

​​

Scene 3

As the previous program was working fine to extract a table create​ and insert​ data, but it was working too slow, taking around 40-50 mints of time, which was too high.

So next thought came in mind to process it with multi threading, and guess what it worked like charm. I was getting a table extracted within 5-10 mints over 250GB of dump files.

 Here is the multi threading program which works on same pattern but process faster.

 

#!/usr/bin/env python3
"""
Extract SQL statements for a specific table from a huge SQL dump file.

Usage:
  python extract_table.py <input.sql> <output.sql> <table_name>

This script will search for statements (DROP, CREATE, INSERT) that mention the
specified table name and write them (in order) to the output file.
"""

import os
import sys
import mmap
from concurrent.futures import ProcessPoolExecutor

def find_statement_end(mm, start):
    """
    Scan mm (a memory-mapped region) starting at 'start' and return the index of the
    semicolon that terminates the SQL statement. This function ignores semicolons that
    are inside single or double quotes.

    Parameters:
      mm: The memory-mapped file (bytes-like object).
      start: The index to start scanning.

    Returns:
      The index of the semicolon that is outside any quoted string, or -1 if not found.
    """
    in_single_quote = False
    in_double_quote = False
    escape = False
    pos = start
    while pos < len(mm):
        c = chr(mm[pos])
        if escape:
            escape = False
        else:
            if c == '\\':
                escape = True
            elif c == "'" and not in_double_quote:
                in_single_quote = not in_single_quote
            elif c == '"' and not in_single_quote:
                in_double_quote = not in_double_quote
            elif c == ';' and not in_single_quote and not in_double_quote:
                return pos
        pos += 1
    return -1

def process_segment(input_file, table_name, seg_start, seg_end, overlap, max_chunk=524288000):
    """
    Process a file segment by mapping subchunks (max_chunk bytes + overlap) with exact
    offset alignment. Searches for SQL statements for the specified table.

    Parameters:
      input_file:   Path to the SQL dump.
      table_name:   Table name to search for.
      seg_start:    Starting byte of the segment.
      seg_end:      Ending byte of the segment.
      overlap:      Extra bytes to capture statements that span subchunks.
      max_chunk:    Maximum number of bytes to map per subchunk (default: 500 MB).

    Returns:
      A list of tuples: (absolute_offset, statement_string)
    """
    results = []
    with open(input_file, 'rb') as f:
        file_size = os.fstat(f.fileno()).st_size
        page_size = os.sysconf("SC_PAGE_SIZE")
        current_offset = seg_start
        while current_offset < seg_end:
            # Align current_offset to a page boundary.
            aligned_offset = (current_offset // page_size) * page_size
            delta = current_offset - aligned_offset

            desired_length = min(max_chunk + overlap, seg_end - current_offset + overlap)
            mapping_length = delta + desired_length

            if aligned_offset + mapping_length > file_size:
                mapping_length = file_size - aligned_offset
                desired_length = mapping_length - delta

            try:
                mm = mmap.mmap(f.fileno(), length=mapping_length, offset=aligned_offset, access=mmap.ACCESS_READ)
            except Exception as e:
                print(f"Error mapping offset {aligned_offset} with length {mapping_length}: {e}", file=sys.stderr)
                break

            pos = delta  # Start at our desired offset inside the mapping.
            mm_size = mm.size()

            # Prepare search patterns as bytes.
            drop_pattern = f"DROP TABLE IF EXISTS `{table_name}`".encode('utf-8')
            create_pattern = f"CREATE TABLE `{table_name}`".encode('utf-8')
            insert_pattern = f"INSERT INTO `{table_name}`".encode('utf-8')
            patterns = [drop_pattern, create_pattern, insert_pattern]

            while pos < mm_size:
                candidate = None
                for pat in patterns:
                    idx = mm.find(pat, pos)
                    if idx != -1 and (candidate is None or idx < candidate):
                        candidate = idx
                if candidate is None:
                    break  # No more matches in this subchunk

                # Use the custom function to locate the correct terminating semicolon.
                semicolon_idx = find_statement_end(mm, candidate)
                if semicolon_idx == -1:
                    break  # No terminating semicolon found

                statement_bytes = mm[candidate:semicolon_idx + 1]
                try:
                    statement = statement_bytes.decode('utf-8', errors='ignore')
                except Exception as e:
                    statement = ""
                    print(f"Decoding error at offset {aligned_offset + candidate}: {e}", file=sys.stderr)
                results.append((aligned_offset + candidate, statement))
                pos = semicolon_idx + 1  # Continue search after the found statement

            mm.close()
            current_offset += max_chunk
    return results

def merge_results(results_lists):
    """
    Merge, sort, and deduplicate lists of (offset, statement) tuples.
    """
    merged = []
    for sublist in results_lists:
        merged.extend(sublist)
    merged.sort(key=lambda x: x[0])
    unique = []
    seen_offsets = set()
    for offset, stmt in merged:
        if offset not in seen_offsets:
            unique.append((offset, stmt))
            seen_offsets.add(offset)
    return unique

def main(input_file, output_file, table_name, num_workers=6, overlap=1024*1024):
    """
    Splits the file into segments, processes them in parallel, merges results,
    and writes the extracted SQL statements to the output file.
    """
    file_size = os.path.getsize(input_file)
    chunk_size = file_size // num_workers
    segments = []
    for i in range(num_workers):
        seg_start = i * chunk_size
        seg_end = file_size if i == num_workers - 1 else (i + 1) * chunk_size
        segments.append((seg_start, seg_end))

    results_all = []
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = []
        for seg_start, seg_end in segments:
            futures.append(executor.submit(process_segment, input_file, table_name, seg_start, seg_end, overlap))
        for future in futures:
            results_all.append(future.result())

    merged_results = merge_results(results_all)

    with open(output_file, 'w', encoding='utf-8') as outf:
        for offset, stmt in merged_results:
            outf.write(stmt)
    
    print(f"Extraction complete. {len(merged_results)} statements written to {output_file}")

if __name__ == '__main__':
    if len(sys.argv) != 4:
        print("Usage: python extract_table.py <input.sql> <output.sql> <table_name>")
        sys.exit(1)
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    table_name = sys.argv[3]
    main(input_file, output_file, table_name)


This includes following features:

Custom Semicolon Search:
The find_statement_end function scans the memory-mapped region from a given start position. It toggles flags when entering or exiting single or double quoted strings and ignores semicolons that occur inside them. Only when it finds a semicolon outside any quotes does it return that index.

Integration into Processing Loop:
In the inner loop of process_segment, after finding the candidate position of one of our patterns, we now call find_statement_end(mm, candidate) instead of a simple mm.find(b";", candidate). This ensures that semicolons inside string literals (like those in an INSERT statement) are not mistaken for statement terminators.

Overall Processing:
The file is still processed in aligned subchunks with an overlap to ensure multi-line statements are captured, and the segments are processed in parallel. The results are merged, sorted, and deduplicated before writing to the output file.


Hope you find this helpful.

Looking for our consulting, drop us mail at [email protected]

Extracting a table create and insert statements with data from .SQL file
Ram Krishna March 9, 2025
Share this post
Sign in to leave a comment
Using S3 Bucket in NestJS to Store Files