Jul 16, 2011

CanVec Batch Downloader (Bash Script)

Summary
I use to use Filezilla (and still do actually, but not for this...) to batch download CanVec GIS data.

Recently I found a new method using a bash script I put together. It's one of two scripts where the second uses ogr2ogr and batch merges the CanVec data into shapefiles.

The batch download script takes a list of NTS sheets in CSV format, makes you a workspace and begins downloading the CanVec data (or Geobase DEMs) and I'll cover the method I use to format these from a DBF to get them ready for this process.

It basically involves stripping the DBF of its header and reducing it to a single column with no special characters.

The formatting is all done in OpenOffice.org Base.




Converting a DBF to a CSV with OpenOffice Base
First I had to install OpenOffice.org Base using apt-get:
sudo apt-get update
sudo apt-get install openoffice.org-base
This also installed the Java Runtime Environment (which I didn't know I didn't have).

Apparently Ubuntu comes with an Open Source Java that seems to do the trick, but OpenOffice.org, unlike LibreOffice asks strictly for the Java RTE but still works with the OS Java.

Once that was all installed I could then open my DBF with Base from the terminal:
oobase GIS/indiGenIS/shp/base/50k_aoi.dbf

It immediately asked me for the Character Set.

This did work by choosing the default but I still chose UTF-8 because it without a doubt supports the characters in this 50k NTS list.


ooBase default Charset highlighted.
If the SHP came with a .cpg file, the character encoding will be known in the DBF

When the DBF opened it contained two columns and a header for the column names:

DBF file in OpenOffice Base showing 50k NTS Sheets

I removed the second column [B] and the first row [1] leaving only a list of 50k NTS sheets:

I right clicked column [B] or row [1] and selected Delete Rows/Columns, not Delete Contents.

It was then time to go to File > Save As...
Look at the bottom of the Save window that comes up and change the file type to CSV using either the pull-down menu or the expandable menu (+).

After I gave it a new filename and placed it in a folder of my choice I clicked Save/OK and this window popped up:

I selected "Keep Current Format" to continue.

After I pressed the "Keep Current Format" button I was then able to adjust the CSV export settings like the 'delimiters' and character set (which I already specified at the beginning so no need to change here):

Default CSV export settings dialog.

I removed the Text and Field delimiters because those will be considered invalid characters in the script and will cause it to bail then pressed OK.

Field and Text delimiters removed.

To verify that I did everything correctly I checked the contents of the CSV file with the terminal built-ins 'CAT' and 'HEAD':
cat GIS/indiGenIS/tables/50k_aoi.csv | head

Using Ubuntu terminal built-ins to check a portion of the CSV file contents.



Bash Scripting and the CanVec Batch Downloader
The script is fairly straight forward.

It asks you for some information before beginning:
(a) Your e-mail?; used as the password and the user-name is 'anonymous'.
(b) Where your CSV file is saved?; the file is checked for errors (minimal).
(*b) The CSV file can be either a list of 250k or 50k sheets, it doesn't matter.
(c) A workspace to save the zip files?; if it doesn't exist it will be created.
(d) What dataset?; Options currently include CanVec 50k or Geobase DEMs 50/250k.

The script itself is not entirely finished because I am adding some 'feedback' for when it finishes downloading like total download time and file size/count.

The script will still get me my zip files and place them nicely in a folder for me.

Download: CanDataFTP_v1.sh 9.6kb

I save my scripts in a folder called 'script' in my HOME directory.

Be sure to set the permissions (make it an executable) prior to running it or it won't execute:
chmod a+x script/CanDataFTP_v1.sh
./script/CanDataFTP_v1.sh
In the above it is ok to omit the 'a' in the 'a+x' so it becomes just '+x'

NOTE: The script will create a .wgetrc file in your home folder to hold the username and password overwriting yours if it exists!

I ought to mention that I have never made a shell script before and everything I learned is from these online resources:

It took me under half a day to put this script together but nearly a month to cover the required material... I am thinking it is worth it as this is a task I repeatedly do and this will streamline that process.

Enjoy.

Next is the CanVec_shp script which batch merges the CanVec GIS data into separate shapefiles using ogr2ogr from the GDAL.



Results [UPDATE]
When I did my selection of NTS sheets to see what CanVec tiles I needed it grabbed 961 50k NTS sheets.

I ran this formatted table (as a CSV) through the script and it finished saying:
Downloaded: 960 files, 3.0G in 1h 20m 5s (664 KB/s)
It says it is missing one!

So I ran the script again but this time adding "--no-clobber" to the wget portion so that any files would NOT be overwritten.

If they existed in my workspace, they would be skipped:
File `canvec_103j10_shp.zip' already there; not retrieving.
File `canvec_103j11_shp.zip' already there; not retrieving.
--2011-07-16 09:28:26-- ftp://ftp2.cits.rncan.gc.ca/pub/canvec/50k_shp/103/j/canvec_103j12_shp.zip
=> `canvec_103j12_shp.zip'
Resolving ftp2.cits.rncan.gc.ca... 192.67.45.79
Connecting to ftp2.cits.rncan.gc.ca|192.67.45.79|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/canvec/50k_shp/103/j ... done.
==> SIZE canvec_103j12_shp.zip ... done.
==> PASV ... done. ==> RETR canvec_103j12_shp.zip ...
No such file `canvec_103j12_shp.zip'.

File `canvec_103j15_shp.zip' already there; not retrieving.
File `canvec_103j16_shp.zip' already there; not retrieving.
It could not find the CanVec sheet for 103j12!

I headed over to the CanVec 8th Edition Datasets list (in txt format) and did a 'find' (Ctrl+F) for that specific sheet; nothing was returned - sure enough it doesn't exist!

Missing NTS sheet for CanVec download is highlighted in Yellow.
It sits in the ocean, maybe a flooded island?


4 comments:

  1. Be careful using wget in sessions that are far apart in time. I used to use a similar process on Windows (http://code.google.com/p/maphew/source/browse/gis/canvec/download-yukon-gml.bat). I stopped when encountering hundreds of megabytes of corrupted zip files.

    It turns out that "wget --continue" doesn't compare timestamps, so when I ran my script in the spring, and then again in the fall after NRCAN had update some of the distribution files, wget simply appended the new data to the old, and thus corrupted the zips.

    Now I'm using Filezilla or WinSCP and using their synchronization feature as it give more flexibility and control for what to do when size or time doesn't match. Save Filezilla's download queue (before it completes any files) as xml and then load next year for re-use. It's not quite as fire-and-forget as a script but pretty darn close.

    ReplyDelete
  2. Good points and I came across your post as well (http://www.maphew.com/Projects/Using_CanVec) when I was researching this batch process. I needed something in bash because I am using Ubuntu and not Windows for this.

    I have been learning more about wget and have updated lots of my script (not posted yet).

    Using the -c (--continue) option can cause local files to become corrupted, you're absolutely right! I read the wget man file a little closer... Two possible scenarios:(a) if server file is smaller than local file, it may be rejected all together
    (b) if server file is larger than local file, even though the header dates will be different, it may try to append the extra data to the local file without actually replacing it.

    By using -N (--timestamping) without -c or -nc (--no-clobber), I have updated the script (not posted yet) so that if it sees a change in either filesize, or last modified date, the file is re-downloaded from scratch regardless. I am going to consider this a small price to pay to maintain local file integrity!

    I like that filezilla is available on linux but I was having some troubles and getting corrupted local zip files when it would try to 'continue' them as well (not all of them though...) so not sure why.

    I never heard of or used WinSCP, will check it out although it sounds like a windows thing by the name of it. Other command line utilities that I have tried with great success are axel and especially aria2c, both open source as well. The help with my limited satellite internet connection at 6-60KB/s. I don't get the broadband speeds anymore though I am working on fixing that as well...

    ReplyDelete
  3. I remember what my problem with filezilla was! My satellite internet connection will often die or cut-out for seconds at a time but it kicks back in.

    When it would kick back in, filezilla would finish the download and say it was complete and move to the next instead of retrying the connection. When I went to 'continue' these files they would just be skipped for some reason as though they were finished even though their sizes were different. Maybe a bug?

    I use wget now because it will just sit there and wait instead of giving a faulty "file finished".

    ReplyDelete
  4. yes WinSCP is for Windows, though I've read linux people using it happily under Wine. I also tried Unison for a time, it certainly has the right feature list and is truly multi-platform but I just couldn't wrap my head around getting it configured properly. I encourage you to contact Nrcan and ask them to add an rsync server to their distribution methods (I do every year). Even torrents would be an improvement. There was a movement a while ago to publish geo data via torrents but it didn't really grow legs.

    ReplyDelete