数据安全是做数据分析的人需要关注的一大问题. 对于我们分析的关键数据, 使用的关键脚本都需要定期备份.
scp
最简单的备份方式, 就是使用 cp (本地硬盘)或 scp (远程硬盘)命令, 给自己的结果文件新建一个拷贝; 每有更新, 再拷贝一份. 具体命令如下:
- cp -fur source_project project_bak
- scp -r source_project user@remote_server_ip:project_bak
为了实现定期备份, 我们可以把上述命令写入 crontab 程序中, 设置每天的晚上 23:00 执行. 对于远程服务器的备份, 我们可以配置免密码登录, 便于自动备份. 后台输入免密码登录服务器, 获取免密码登录服务器的方法.
- # Crontab format
- # MinuteHourDayMonthWeekcommand
- # * 表示每分 / 时 / 天 / 月 / 周
- # 每天 23:00 执行 cp 命令
- 0 23 * * * cp -fur source_project project_bak
- # */2 表示每隔 2 分分 / 时 / 天 / 月 / 周执行命令
- # 每隔 24 小时执行 cp 命令
- 0 */24 * * * cp -fur source_project project_bak
- 0 0 */1 * * scp -r source_project user@remote_server_ip:project_bak
- # 另外 crotab 还有个特殊的时间
- # @reboot: 开机运行指定命令
- @reboot cmd
- rsync
cp 或 scp 使用简单, 但每次执行都会对所有文件进行拷贝, 耗时耗力, 尤其是需要拷贝的内容很多时, 重复拷贝对时间和硬盘都是个损耗.
rsync 则是一个增量备份工具, 只针对修改过的文件的修改过的部分进行同步备份, 大大缩短了传输的文件的数量和传输时间. 具体使用如下 :
- # 把本地 project 目录下的东西备份到远程服务器的 / backup/project 目录下
- # 注意第一个 project 后面的反斜线, 表示拷贝目录内的内容, 不在目标目录新建 project 文件夹. 注意与第二个命令的比较, 两者实现同样的功能.
- # -a: archive mode, quals -rlptgoD
- # -r: 递归同步
- # -p: 同步时保留原文件的权限设置
- # -u: 若文件在远端做过更新, 则不同步, 避免覆盖远端的修改
- # -L: 同步符号链接链接的文件, 防止在远程服务器出现文件路径等不匹配导致的软连接失效
- # -t: 保留修改时间
- # -v: 显示更新信息
- # -z: 传输过程中压缩文件, 对于传输速度慢时适用
- rsync -aruLptvz --delete project/ user@remoteServer:/backup/project
- rsync -aruLptvz --delete project user@remoteServer:/backup/
rsync 所做的工作为镜像, 保证远端服务器与本地文件的统一. 如果本地文件没问题, 远端也不会有问题. 但如果发生误删或因程序运行错误, 导致文件出问题, 而在同步之前又没有意识到的话, 远端的备份也就没了备份的意义, 因为它也被损坏了. 误删是比较容易发现的, 可以及时矫正. 但程序运行出问题, 则不一定了.
rdiff-backup
这里推荐一个工具 rdiff-backup 不只可以做增量备份, 而且会保留每次备份的状态, 新备份和上一次备份的差别, 可以轻松回到之前的某个版本. 唯一的要求就是, 本地服务器和远端服务器需要安装统一版本的 rdiff-backup. 另外还有 2 款工具 duplicity 和 `Rsnapshot 也可以做类似工作, 但方法不一样, 占用的磁盘空间也不一样, 具体可查看原文链接中的比较.
具体的 rdiff-backup 安装和使用如下 (之前写的是英文, 内容比较简单, 就不再翻译了):
- Install rdiff-backup at both local and remote computers
- #install for Ubuntu, debian
- sudo apt-get install python-dev librsync-dev
- #self compile
- #downlaod rsync-dev from https://sourceforge.NET/project/showfiles.PHP?group_id=56125
- tar xvzf librsync-0.9.7.tar.gz
- export CFLAGS="$CFLAGS -fPIC"
- ./configure --prefix=/home/user/rsync --with-pic
- make
- make install
- Install rdiff-backup
- #See Reference part for download link
- # http://www.nongnu.org/rdiff-backup/
- python setup.py install --prefix=/home/user/rdiff-backup
- #If you complied rsync-dev yourself, please specify the location of rsync-dev
- python setup.py --librsync-dir=/home/user/rsync install -- prefix=/home/user/rdiff-backup
- Add exeutable files and python modules to environmental variables
- #Add the following words into .bashrc or .bash_profile or any other config files
- export PATH=${PATH}:/home/user/rdiff-backup/bin
- export PYTHONPATH=${PYTHONPATH}:/home/user/rdiff-backup/lib/python2.x/site-packages
- #pay attention to the x in python2.x of above line which can be 6 or 7 depending on
- #the Python version used.
- Test environmental variable when executing commands through SSH
- SSH user@host 'echo ${PATH}' #When I run this command in my local computer,
- #I found only system environmetal variable is used
- #and none of my self-defined environmetal variable is used.
- #Then, I modified the following lines in file 'SetConnections.py' in
- #/home/user/rdiff-backup/lib/python2.x/site-packages/rdiff_backup
- #to set environmental explicitly when login.
- #pay attention to the single quote used inside double quote
- __cmd_schema = "SSH -C %s'source ~/.bash_profile; rdiff-backup --server'"__cmd_schema_no_compress ="SSH %s 'source ~/.bash_profile; rdiff-backup --server'"
- #choose the one contains environmental variable for rdiff-backup from .bash_profile and .bashrc.
- Use rdiff-backup
- Start backup
- rdiff-backup --no-compression --print-statistics user@host::/home/user/source_dir destination_dir
- If the destination_dir exists, please add --force like rdiff-backup --no-compression --force --print-statistics user@host::/home/user/source_dir destination_dir. All things in original destination_dir will be depleted.
- If you want to exclude or include special files or dirs please specify like --exclude '**trash' or --include /home/user/source_dir/important.
- Timely backup your data
- Add the above command into crontab (hit 'crontab -e' in terminal to open crontab) in the format like 5 22 */1 * * command which means executing the command at 22:05 everyday.
- Restore data
- Restore the latest data by running rdiff-backup -r now destination_dir user@host::/home/user/source_dir.restore. Add --force if you want to restore to source_dir.
- Restore files 10 days ago by running rdiff-backup -r 10D destination_dir user@host::/home/user/source_dir.restore. Other acceptable time formats include 5m4s (5 minutes 4 seconds) and 2014-01-01 (January 1st, 2014).
- Restore files from an increment file by running rdiff-backup destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing user@host::/home/user/source_dir.restore/server_add. Increment files are stored in destination_dir/rdiff-backup-data/increments/server_add.2014-02-21T09:22:45+08:00.missing.
- Remove older records to save space
- Deletes all information concerning file versions which have not been current for 2 weeks by running rdiff-backup --remove-older-than 2W --force destination_dir. Note that an existing file which has not changed for a year will still be preserved. But a file which was deleted 15 days ago can not be restored after this command. Normally one should use --force since it is used to delete multiple increments at the same time which --remove-older-thanrefuses to do by default.
- Only keeps the last n rdiff-backup sessions by running rdiff-backup --remove-older-than 20B --force destination_dir.
- Statistics
- Lists increments in given golder by rdiff-backup --list-increments destination_dir/.
- Lists of files changed in last 5 days by rdiff-backup --list-changed-since 5D destination_dir/.
- Compare the difference between source and bak by rdiff-backup --compare user@host::source-dir destination_dir
- Compare the sifference between source and bak (as it was two weeks ago) by rdiff-backup --compare-at-time 2W user@host::source-dir destination_dir.
- A complete script (automatically sync using crontab)
- #!/bin/bash
- export PYTHONPATH=${PYTHONPATH}:/soft/rdiff_backup/lib/python2.7/site-packages/
- rdiff-backup --no-compression -v5 --exclude '**trash' user@server::source/ bak_dir/
- ret=$?
- if test $ret -ne 0; then
- echo "Wrong in bak" | mutt -s "Wrong in bak" bak@mail.com
- else
- echo "Right in bak" | mutt -s "Right in bak" bak@mail.com
- fi
- echo "Finish rdiff-backup $0 ---`date`---" >>bak.log 2>&1
- echo "`rdiff-backup --exclude'**trash'--compare-at-time 1D user@server::source/ bak_dir/`" | mutt -s "Lists of baked files" bak@mail.com
- References
- rdiff-backup
- duplicity
- rsnapshot
- http://www.saltycrane.com/blog/2008/02/backup-on-Linux-rsnapshot-vs-rdiff/
- http://james.lab6.com/2008/07/09/rdiff-backup-and-duplicity/
- http://bitflop.com/document/75
- http://askubuntu.com/questions/2596/comparison-of-backup-tools
- http://www.reddit.com/r/Linux/comments/fgmbb/rdiffbackup_duplicity_or_rsnapshot_which_is/
- http://serverfault.com/questions/491341/optimize-space-rdiff-backup
- Another great post on usage of rdiff-backup
来源: http://server.51cto.com/sOS-583662.htm