As part of an automated process I have a service that recursively iterates through an FTP directory and pulls down any new or changed files saving them to disk before being uploaded into a CDN (Content delivery network) where they are used on the Web. Another automated process digitalizes published magazines and writes the images and text as XML documents into this FTP directory (so it can be consumed).
This has been running for some time when I noticed that there were a lot of images that were not being copied down from the FTP server. After some investigation I noticed that the only images with spaces in the name were failing to download and upon looking into my code it was pretty clear what was going wrong.
Processing the LIST Command Result
I was using the WebRequestMethods.Ftp.ListDirectoryDetails flag which makes the FtpWebRequest object use the FTP LIST command to retrieve the file names in each directory. The FTP server is a Unix based computer thus when I execute the directory LIST command it returns back a CLRF delimited blob of text containing a record describing each file in the directory as a record that looks like the following:
-rw-r--r-- 1 1089 1091 505482 Nov 19 22:53 paper texture 2jpg1290206009609589.JPG
This breaks down into the following structure which is delimited by a space:
|mode ||links ||owner ||group ||size ||datetime ||name |
| -rw-r--r-- ||1 ||1089 ||1091 ||505482 ||Nov 19 22:53 ||paper texture 2jpg1290206009609589.JPG |
Note that there is another FTP command NLIST, which returns back just the filename itself which works just fine, but it doesn’t given you enough information about what kind of entry the file is. For example, if I encounter a folder, I want to step into and read the contents. If it’s a file, I want to download the file. The mode portion of the LIST result gives you the information required to make this is decision: if the first element is a “d”, it’s a directory so keep traversing…otherwise assume it’s a file so download it.
The Problem with Spaces (and my code)
When I process a LIST record string, I split it into an array and based on ordinal (position-the last element) I select the name element:
The problem is that when the name element has spaces, the last element is the last whole part of the file name – any proceeding parts are split into separate elements and ignored. So for most files, there were no problems; it wasn’t until files with spaces in the name began appearing did the appear.
Regular Expressions To The Rescue
A quick Google search revealed that people who have encountered this problem, used regex’s to correctly parse the LIST result record. In particular, this post on stackoverflow hit the nail on the head:
A quick unit test confirms the simplest case: