最近有个需求是要下载ftp接近十个T的数据,在调研过多个工具后发现还是lftp的mirror最省事
mirror参数
Mirror specified source directory to local target directory. If target directory ends with a slash, the source base name is appended to target directory name. Sourceand/or target can be URLs pointing to directories.-c, --continue continue a mirror job if possible-e, --delete delete files not present at remote site--delete-first delete old files before transferring new ones--depth-first descend into subdirectories before transferring files-s, --allow-suid set suid/sgid bits according to remote site--allow-chown try to set owner and group on files--ascii use ascii mode transfers (implies --ignore-size)--ignore-time ignore time when deciding whether to download--ignore-size ignore size when deciding whether to download--only-missing download only missing files--only-existing download only files already existing at target-n, --only-newer download only newer files (-c won't work)--no-empty-dirs don't create empty directories (implies --depth-first)-r, --no-recursion don't go to subdirectories--no-symlinks don't create symbolic links-p, --no-perms don't set file permissions--no-umask don't apply umask to file modes-R, --reverse reverse mirror (put files)-L, --dereference download symbolic links as files-N, --newer-than=SPEC download only files newer than specified time--on-change=CMD execute the command if anything has been changed--older-than=SPEC download only files older than specified time--size-range=RANGE download only files with size in specified range-P, --parallel[=N] download N files in parallel--use-pget[-n=N] use pget to transfer every single file--loop loop until no changes found-i RX, --include RX include matching files-x RX, --exclude RX exclude matching files-I GP, --include-glob GP include matching files-X GP, --exclude-glob GP exclude matching files-v, --verbose[=level] verbose operation--log=FILE write lftp commands being executed to FILE--script=FILE write lftp commands to FILE, but don't execute them--just-print, --dry-run same as --script=---use-cache use cached directory listings--Remove-source-files remove files after transfer (use with caution)-a same as --allow-chown --allow-suid --no-umask
问题记录
1.虽然mirror支持多线程,我们也是针对三个大目录(其中很多子目录)下载,但是整个过程中list列表比较费时间,建议是直接mirror子目录 这样线程会多一些。
2.注意使用--only-missing参数,其他的参数比如only-newer 不太清楚原因但是会先删掉本地再下载一遍
#!/bin/bash# FTP服务器信息
FTP_HOST="xxxxx"
FTP_USER="xxxx"
FTP_PASS="xxxxxxx"# 定义要同步的远程和本地目录对declare -A DIR_MAP=(
["/fumulu/zimulu1"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu2"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu3"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu4"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu5"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu6"]="/data/0/bendi/fumulu/"
["/fumulu/zimulu7"]="/data/0/bendi/fumulu/"
)
# 创建日志目录
LOG_DIR="sync_logs"
mkdir -p "$LOG_DIR"sync_directory() {local remote_dir=$1local local_dir=$2# 生成日志文件名(将目录分隔符替换为下划线)local log_name=$(echo "${remote_dir}" | tr '/' '_')local log_file="$LOG_DIR/${log_name}sync.log"# 确保本地目录存在mkdir -p "$local_dir"echo "开始同步 $remote_dir 到 $local_dir..." | tee -a "$log_file"echo "同步开始时间: $(date)" >> "$log_file"# 使用lftp进行同步操作,添加 --size-only 参数temp_log=$(mktemp)lftp -c "open -u $FTP_USER,$FTP_PASS $FTP_HOST; \mirror --parallel=1000 --verbose --only-missing $remote_dir $local_dir" 2>&1 | tee -a "$temp_log" "$log_file"# 检查文件下载失败的情况if grep -i "File not available" "$temp_log" > /dev/null; thenecho "发现文件下载失败,记录到 shibai.txt..."# 提取并记录失败的文件信息grep -i "File not available" "$temp_log" | while read -r line; do# 提取完整的文件路径和文件名full_path=$(echo "$line" | grep -o "@.*" | cut -d' ' -f1)echo "$full_path" >> shibai.txtdonefiecho "同步结束时间: $(date)" >> "$log_file"echo "----------------------------------------" >> "$log_file"# 清理临时日志文件rm -f "$temp_log"
}# 同时启动所有同步任务
for remote_dir in "${!DIR_MAP[@]}"; dolocal_dir=${DIR_MAP[$remote_dir]}sync_directory "$remote_dir" "$local_dir" &
done# 等待所有后台任务完成
waitecho "所有同步任务已完成。"