Introduction
Recently, we embarked on a migration project for an application built with CakePHP, which supported PHP 7.4. While the initial stages were smooth, we faced a challenge when we examined the database. It was using MySQL version 5.7, which is quite outdated. Therefore, we decided to also migrate the database. However, due to certain limitations imposed by the project team, we only received the database dumps without any additional context regarding the environments or other relevant information.
We set out to install a newer version of MySQL, specifically 8.0.x. Given that the database dump was quite large, over 116GB, we understood that we couldn't simply execute the import without careful consideration. As a result, we opted to plan our approach meticulously. Let's see how we can move forward from here.
Scene 1:
We configured the MYSQL Server and established a database to restore the backup.
In first step we just run the command with nohup like:
sudo mysql -u root database_name < source_dump.sql
This command just started the restore and as the database size was too large, so we used nohup like:
sudo su -
nohup mysql --binary-mode=1 -u root
database_name
< source_dump
.sql > restore_output.log 2>&1 &
It appears to be functioning well initially, but after a few hours of operation, it encountered an error due to a version mismatch with MySQL. The source dump was created using MySQL version 5.7, while the restoration was attempted on version 8.0.
We fixed the problem with searching and later again run it.
Scene 2
As we proceeded with the database restoration, we encountered some issues with particular tables. Moreover, our system had a storage limitation of approximately 300+GB, which included a 116+GB backup file. After the restoration, we were left with only 50GB of available storage.
Later we thought to split the database dump in multiple chunks of file, with search on create table statement
, but it also require storage.
We decided to run the application and wanted to see the breaking tables. We listed down the breaking tables and started thing to extract that specific table create
and insert
data from the large SQL file.
In first try we were using sed -n '3426,$p'
source_data_dump
.sql > extracted_table.sql
But for this we need to know the line number and ending line number of the table alsong with insert
. This was tricky, and we left with no answer.
Later we decided to extract the table data using program, so we written a simple python program to extract the specific table data, where program will take 3 inputs as
sourceDB.sql , output.sql, and the table_name.
Checkout the program here:
import sys
if len(sys.argv) != 4:
print("Usage: python extract_table.py <input.sql> <output.sql> <table_name>")
sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
table_name = sys.argv[3]
inside_create = False
with open(input_file, 'r', encoding='utf-8', errors='ignore') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
for line in infile:
# Optionally, capture the DROP statement if it exists
if f"DROP TABLE IF EXISTS `{table_name}`" in line:
outfile.write(line)
# Start capturing the CREATE TABLE block
if not inside_create and f"CREATE TABLE `{table_name}`" in line:
inside_create = True
if inside_create:
outfile.write(line)
# Assuming that the CREATE TABLE statement ends at a semicolon on its own line or at the end of a line.
if line.strip().endswith(';'):
inside_create = False
# Capture any INSERT statements for the table.
# (Assuming INSERT statements start on a new line.)
if line.strip().startswith(f"INSERT INTO `{table_name}`"):
outfile.write(line)
Scene 3
As the previous program was working fine to extract a table create
and insert
data, but it was working too slow, taking around 40-50 mints of time, which was too high.
So next thought came in mind to process it with multi threading, and guess what it worked like charm. I was getting a table extracted within 5-10 mints over 250GB of dump files.
Here is the multi threading program which works on same pattern but process faster.
#!/usr/bin/env python3
"""
Extract SQL statements for a specific table from a huge SQL dump file.
Usage:
python extract_table.py <input.sql> <output.sql> <table_name>
This script will search for statements (DROP, CREATE, INSERT) that mention the
specified table name and write them (in order) to the output file.
"""
import os
import sys
import mmap
from concurrent.futures import ProcessPoolExecutor
def find_statement_end(mm, start):
"""
Scan mm (a memory-mapped region) starting at 'start' and return the index of the
semicolon that terminates the SQL statement. This function ignores semicolons that
are inside single or double quotes.
Parameters:
mm: The memory-mapped file (bytes-like object).
start: The index to start scanning.
Returns:
The index of the semicolon that is outside any quoted string, or -1 if not found.
"""
in_single_quote = False
in_double_quote = False
escape = False
pos = start
while pos < len(mm):
c = chr(mm[pos])
if escape:
escape = False
else:
if c == '\\':
escape = True
elif c == "'" and not in_double_quote:
in_single_quote = not in_single_quote
elif c == '"' and not in_single_quote:
in_double_quote = not in_double_quote
elif c == ';' and not in_single_quote and not in_double_quote:
return pos
pos += 1
return -1
def process_segment(input_file, table_name, seg_start, seg_end, overlap, max_chunk=524288000):
"""
Process a file segment by mapping subchunks (max_chunk bytes + overlap) with exact
offset alignment. Searches for SQL statements for the specified table.
Parameters:
input_file: Path to the SQL dump.
table_name: Table name to search for.
seg_start: Starting byte of the segment.
seg_end: Ending byte of the segment.
overlap: Extra bytes to capture statements that span subchunks.
max_chunk: Maximum number of bytes to map per subchunk (default: 500 MB).
Returns:
A list of tuples: (absolute_offset, statement_string)
"""
results = []
with open(input_file, 'rb') as f:
file_size = os.fstat(f.fileno()).st_size
page_size = os.sysconf("SC_PAGE_SIZE")
current_offset = seg_start
while current_offset < seg_end:
# Align current_offset to a page boundary.
aligned_offset = (current_offset // page_size) * page_size
delta = current_offset - aligned_offset
desired_length = min(max_chunk + overlap, seg_end - current_offset + overlap)
mapping_length = delta + desired_length
if aligned_offset + mapping_length > file_size:
mapping_length = file_size - aligned_offset
desired_length = mapping_length - delta
try:
mm = mmap.mmap(f.fileno(), length=mapping_length, offset=aligned_offset, access=mmap.ACCESS_READ)
except Exception as e:
print(f"Error mapping offset {aligned_offset} with length {mapping_length}: {e}", file=sys.stderr)
break
pos = delta # Start at our desired offset inside the mapping.
mm_size = mm.size()
# Prepare search patterns as bytes.
drop_pattern = f"DROP TABLE IF EXISTS `{table_name}`".encode('utf-8')
create_pattern = f"CREATE TABLE `{table_name}`".encode('utf-8')
insert_pattern = f"INSERT INTO `{table_name}`".encode('utf-8')
patterns = [drop_pattern, create_pattern, insert_pattern]
while pos < mm_size:
candidate = None
for pat in patterns:
idx = mm.find(pat, pos)
if idx != -1 and (candidate is None or idx < candidate):
candidate = idx
if candidate is None:
break # No more matches in this subchunk
# Use the custom function to locate the correct terminating semicolon.
semicolon_idx = find_statement_end(mm, candidate)
if semicolon_idx == -1:
break # No terminating semicolon found
statement_bytes = mm[candidate:semicolon_idx + 1]
try:
statement = statement_bytes.decode('utf-8', errors='ignore')
except Exception as e:
statement = ""
print(f"Decoding error at offset {aligned_offset + candidate}: {e}", file=sys.stderr)
results.append((aligned_offset + candidate, statement))
pos = semicolon_idx + 1 # Continue search after the found statement
mm.close()
current_offset += max_chunk
return results
def merge_results(results_lists):
"""
Merge, sort, and deduplicate lists of (offset, statement) tuples.
"""
merged = []
for sublist in results_lists:
merged.extend(sublist)
merged.sort(key=lambda x: x[0])
unique = []
seen_offsets = set()
for offset, stmt in merged:
if offset not in seen_offsets:
unique.append((offset, stmt))
seen_offsets.add(offset)
return unique
def main(input_file, output_file, table_name, num_workers=6, overlap=1024*1024):
"""
Splits the file into segments, processes them in parallel, merges results,
and writes the extracted SQL statements to the output file.
"""
file_size = os.path.getsize(input_file)
chunk_size = file_size // num_workers
segments = []
for i in range(num_workers):
seg_start = i * chunk_size
seg_end = file_size if i == num_workers - 1 else (i + 1) * chunk_size
segments.append((seg_start, seg_end))
results_all = []
with ProcessPoolExecutor(max_workers=num_workers) as executor:
futures = []
for seg_start, seg_end in segments:
futures.append(executor.submit(process_segment, input_file, table_name, seg_start, seg_end, overlap))
for future in futures:
results_all.append(future.result())
merged_results = merge_results(results_all)
with open(output_file, 'w', encoding='utf-8') as outf:
for offset, stmt in merged_results:
outf.write(stmt)
print(f"Extraction complete. {len(merged_results)} statements written to {output_file}")
if __name__ == '__main__':
if len(sys.argv) != 4:
print("Usage: python extract_table.py <input.sql> <output.sql> <table_name>")
sys.exit(1)
input_file = sys.argv[1]
output_file = sys.argv[2]
table_name = sys.argv[3]
main(input_file, output_file, table_name)
This includes following features:
Custom Semicolon Search:
The find_statement_end function scans the memory-mapped region from a given start position. It toggles flags when entering or exiting single or double quoted strings and ignores semicolons that occur inside them. Only when it finds a semicolon outside any quotes does it return that index.
Integration into Processing Loop:
In the inner loop of process_segment, after finding the candidate position of one of our patterns, we now call find_statement_end(mm, candidate) instead of a simple mm.find(b";", candidate). This ensures that semicolons inside string literals (like those in an INSERT statement) are not mistaken for statement terminators.
Overall Processing:
The file is still processed in aligned subchunks with an overlap to ensure multi-line statements are captured, and the segments are processed in parallel. The results are merged, sorted, and deduplicated before writing to the output file.
Hope you find this helpful.
Looking for our consulting, drop us mail at [email protected]
Extracting a table create and insert statements with data from .SQL file