ILLEGAL CHARACTERS in Filenames

June 21, 2011

There are a lot of existing Unix/Linux shell scripts that presume there are no space characters in filenames, including the default setting of the Bourne shell “IFS” variable.
[replace the internal blanks with hyphens; see command below]

Each pathname component is separated by “/” therefore, filenames cannot contain “/”. Neither filenames nor pathnames can contain the ASCII NUL character (\0), because that is the terminator.

It has been said “If you’re designing a wrench, don’t put razor blades on the handles.”

some operating systems like Plan 9 expressly forbid spaces in filenames.

For these reasons and more,
Only start and end with letters and numbers. … stay with ascii!
certainly DON’T START with these 3: -~#   (and let only your text editor add them on to the end)
Forbid !@$&*()?:[]"<>'`|={}\/,; and spaces!     only internally allow -~#%^_+  
Forbid ASCII control characters (bytes 1-31 and 127), especially newline, escape, and tab.
Forbid non-ascii 128-255! (use UTF-8)
Forbid filenames that aren’t a valid UTF-8 encoding (UTF-8 includes ascii)
Forbid the space character (any invisible char!). Search and replace them with “-” or “_”
note: # and ~ are used to indicate backup and saved copies by editors.

for examples of just how bad the nightmare can get if you don’t clean up your file names, see
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

As much as possible, stay within the
“Portable Filename Character Set”, defined in 3.276 (“Portable Filename Character Set”); this turns out to be just A-Z, a-z, 0-9, period, underscore, and hyphen (aka the dash character).


Find and Replace all blanks in file names

find . -name "* *" # find all files with 1 or more blanks in their names … in the current dir. and in all subdir’s.the next 3 will find a few more unsavory characters.

find . -name  *\'* 
find . -name  *\,* 
find . -name  *\&* 

replace the first (only?) blank in all files.
rename \ - * # works only in the current dir.

the first line below will remove the first blank in every folder and file, in and under the current folder. The second line will remove the first single quote, the third, the first comma, the forth, the first &, and shows you can replace it with more than 1 char.

 find . -name '*'  -exec rename  \   -   {};  

 find . -name  '*' -exec rename  \'  ''  {} \;
 find . -name  '*' -exec rename  \,  ''  {} \;
 find . -name  '*' -exec rename  \&  -n- {} \;

Note: Do not globally replace/remove blanks from your home folder as several applications (ex: google_chrome and thunderbird) have files and folders with blanks in them and you will loose your personal data.

Detox
A fedora/linux “Utility to replace problematic characters in file names”
“Detox is a utility designed to clean up file names. It replaces difficult to work with characters, such as spaces, with standard equivalents. It will also clean up file names with UTF-8 or Latin-1 (or CP-1252) characters in them.”


Windows

Also, in Windows, \ and / are both interpreted as directory name separators, and there are issues with ” .[];= and ,

Microsoft Windows also makes some terrible mistakes with its filesystem naming
Windows has very arbitrary interpretations of filenames, which can make it dangerous.

For example, if there is a directory called “c:\temp”, and you run the following command from Windows’ “cmd”:

mkdir c:\temp
echo hi > c:\temp\Com1.txt

You might think that this sequence creates a file named “c:\temp\Com1.txt”. You would be wrong; it doesn’t create a file at all. Instead, this writes the text to the serial port. In fact, there are a vast number of special filenames, and even extensions don’t help. Since filenames are often generated from attacker data, this can be a real problem. I’ve confirmed this example with Windows XP, but I believe it’s true for many versions of Windows.

If it’s too hard to write good examples of easy tasks that do the job correctly, then the system is making it too hard to do the job correctly!


Control characters

Decimal   Hexadecimal   Name                                 Gambas character
-----------------------------------------------------------------------------
0         00            NUL                                  '\0' (1)
1         01            SOH    Start of Header
2         02            STX    Start of Text
3         03            ETX    End of Text
4         04            EOT    End of Transmission
5         05            ENQ    Enquiry
6         06            ACK    (positive) Acknowledgement
7         07            BEL    Audible Signal (Bell or Beep) '\a' (1)
8         08            BS     Backspace                     '\b' (1)
9         09            HT     Horizontal Tab                '\t'
10        0A            LF     Line Feed                     '\n'
11        0B            VT     Vertical Tab                  '\v' (1)
12        0C            FF     Form Feed                     '\f' (1)
13        0D            CR     Carriage Return               '\r'
14        0E            SO     Shift Out
15        0F            SI     Shift In
16        10            DLE    Data Link Escape
17        11            DC1    Device Control 1
18        12            DC2    Device Control 2
19        13            DC3    Device Control 3
20        14            DC4    Device Control 4
21        15            NAK    negative Acknowledgement
22        16            SYN    Synchronise
23        17            ETB    End of Transmission Block
24        18            CAN    Cancel
25        19            EM     End Of Medium
26        1A            SUB    Substitute
27        1B            ESC    Escape
28        1C            FS     File Separator
29        1D            GS     Group Separator
30        1E            RS     Record Separator
31        1F            US     Unit Separator

Meta-characters
Characters that must be escaped in a shell before they can be used as an ordinary character are termed “shell meta-characters”.

forbid the glob characters ( *, ?, and [ )
– this can eliminate many errors due to forgetting to double-quote a variable reference in the Bourne shell.
forbid the HTML special characters “<", ">“, “&”, and “””, which will eliminate many errors caused by incorrectly escaping filenames.
forbid the backslash character – this will eliminate requiring the -r option of Bourne shell read. Finally,
forbid shell meta-characters, which eliminates having to escape them in many circumstances.

If filenames never had characters that needed to be escaped, there’d be one less operation that could fail – in many situations.

A useful starting-point list is “ *?:[]"<>|(){}&'!\; “. The colon causes trouble with Windows and MacOS systems, and causes problems in Linux/Unix because it’s a directory separator in many directory or file lists (including PATH, bash CDPATH, gcc COMPILER_PATH, and gcc LIBRARY_PATH), and it has a special meaning in a URL/URI. Note that & is on the list;

The Unix-haters handbook page 167 (PDF page 205) begins Jamie Zawinski’s multi-page description of his frustrated 1992 effort to simply “find all .el files in a directory tree that didn’t have a corresponding .elc file. That should be easy.”

Here’s David Wheeler’s complex, live with the messy reality, successful solution, and as he says, “such powerful grenade-like techniques should not necessary!”:

 
  IFS="`printf '\n\t'`"
  for file in `find . -name '*.el'` ; do
    if [ ! -f "${file}c" ] ; then
      echo "$file"
    fi
  done

This approach (above) just sets IFS to the value it should normally have anyway, followed by a single bog-standard loop over the result of “find”. This actually handles the entire tree as Zawinski wanted, and it handles spaces-in-filenames correctly. It also handles empty directories, and it handles meta-characters in filenames. If we also required that filenames be UTF-8, then we could be certain that the displayed characters would be sensible. This particular program works even when file components begin with “-“, because “find” will prefix the filenames with “./”, but preventing such filenames is still imperative for many other programs (the call to echo would fail and possibly be dangerous if the filename had been acquired via a glob like *). My approach also avoids piping its results to another shell to run. There’s nothing wrong with having a shell run a program generated by another program (it’s a powerful technique), but if you use this technique, small errors can have catastrophic effects (in Zawinski’s example, a filename with meta-characters could cause disaster). So it’s best to use the “run generated code” approach only when necessary. This is a trivial problem; such powerful grenade-like techniques should not necessary! Most importantly, it’s easy to generalize this approach to arbitrary file processing.

The point: Adding small limits to filenames makes it much easier to create completely-correct programs. Especially since most software developers act as if these limitations were already being enforced.

The problem is that currently there’s no mechanism for enforcing any policy. Yet it’s often easy for someone to create filenames that trigger file-processing errors in others’ programs (including system programs), leading to foul-ups and hacker exploits. Let administrators determine policies like which [characters] must never occur in filenames, which [characters] must not be prefixes, which [characters] must not be suffixes, and whether or not to enforce UTF-8. All that’s needed in the kernel is a mechanism to enforce such a policy. After all, the problem is so bad that there are programs like detox and Glindra to fix bad filenames.

So what steps could be taken

Merely forbidding their creation might be enough for a lot of purposes. On the other hand, if you also hide any such filenames that do exist, you have a complete solution – applications on that system can then trust that such “bad” filenames do not exist, and thus hiding such files essentially treats bad filenames like data corruption. I think that if you hide files with “bad” filenames, then you should reject all requests to open a bad filename… whether you’re creating it or not. Administrators would determine how they should be viewed if they are already there (e.g., in directories): as-is, hidden (not viewed at all), or escaped (see the next point)? Another would determine if they can be opened if the bad filename is used to open it (yes or no); obviously this would only have effect if bad filenames had been created in the first place. There would also be the issue of escaped filenames; if there is a fixed escaping mechanism, you configure which file wins if the the escaped name equals the name of another file.

If bad filenames cannot be viewed (because they are escaped or hidden), then you have a complete solution. That is, at that point, all application programs that assumed that filenames are reasonable will suddenly work correctly in all cases. At least on that system, bad filenames can no longer cause mysterious problems and bugs.

Few people really believe that filenames should have this junk, and you can prove that just by observing their actions. Their programs, when you read them, are littered with assumptions that filenames are “reasonable”. They assume that newlines and tabs aren’t in filenames, that filenames don’t start with “-“, that you can meaningfully and safely print filenames, and so on.

1. Forbid/escape ASCII control characters (bytes 1-31 and 127) in filenames, including newline, escape, and tab. I know of no user or program that actually requires this capability. As far as I can tell, this capability exists only to make it hard to write correct software, to ease the job of attackers, and to create interoperability problems. Chuck it.

2. Forbid a leading “-“. This way, you can always distinguish option flags from filenames, eliminating a host of stupid errors. Nobody in their right mind writes programs that depend on having dash-prefixed files on a Unix system. Even on Windows systems they’re a bad idea, because many programs use “-” instead of “/” to identify options.

3. Forbid filenames that aren’t a valid UTF-8 encoding. This way, filenames can always be correctly displayed. Trying to use environment values like LC_ALL (or other LC_* values) or LANG is just a hack that often fails. This will take time, as people slowly transition and minor tool problems get fixed, but I believe that transition is already well underway.

4. Forbid space characters. These confuse users when they happen, with no utility. In particular, filenames that are only space characters are nothing but trouble. I doubt this’ll be acceptable everywhere, but it should be an option.

5. Forbid “problematic” characters that get specially interpreted by shells, other interpreters (such as perl), and HTML. This is less important, and I would expect this to happen (at most) on specific systems. Forbidding “<" and ">” would eliminate a source of nasty errors for perl programs, web applications, and anyone using HTML or XML. A more stringent list would be “ *?:[]"<>|(){}&'!\; ” (this is Glindra’s “safe” list with ampersand, single-quote, bang, backslash, and semicolon added). If this set can be determined locally, based on local requirements, there’s less need to get complete agreement on a list.

6. Forbid leading “~” (tilde). Shells specially interpret such filenames. (Trailing ~’s are used by some editors for backup copies.)

– David Wheeler

www.dwheeler.com/essays/fixing-unix-linux-filenames.html


www.dwheeler.com/essays/filenames-in-shell.html
David Wheeler:

Filenames and Pathnames in Shell: How to do it correctly

OR, cleanup the filenames so that your script will work!

Most developers and users of Bourne shells (including bash, dash, ash, and ksh) assume nearly all the restrictions above.

Even good textbooks on shell programming, and many examples in the POSIX standard, assume the above constraints. Thus, many shell scripts are buggy, leading to surprising failures. These failures are a significant source of security vulnerabilities (see the “Secure Programming for Linux and Unix HOWTO” section on filenames, CERT’s “Secure Coding” item MSC09-C, CWE 78, CWE 73, CWE 116, and the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors).

This little essay explains how to correctly process filenames in Bourne shells. I presume that you already know how to write Bourne shell scripts.

mistakes

If a filename begins with “-“, it will be misinterpreted as an option instead of as a filename.

If any filename contains a space, newline, or tab, its name will be split and treated as 2 or more files, all non-existent.

if a filename includes “\”, it’ll get corrupted; in particular, if it ends in “\”, it will be combined with the next filename (trashing both).

Note that many of the examples in the POSIX standard xargs section will fail; filenames with spaces, newlines, or many other characters will cause many of the examples to fail.

Doing it correctly, the very hard correct way

So, how can you process filenames correctly in shell? Here’s a quick summary about how to do it correctly, for the impatient who “just want the answer”. In short: Double-quote to use “$variable” instead of $variable, set IFS to just newline and tab, prefix all globs/filenames so they cannot begin with “-” when expanded, and use one of a few templates that work correctly. Here are some of those templates that work correctly:

IFS=”$(printf ‘\n\t’)” # Remove ‘space’, so filenames with spaces work well.

# Correct glob use: always use “for” loop, prefix glob, check for existence:
for file in ./* ; do # Use “./*”, NEVER bare “*”
if [ -e “$file” ] ; then # Make sure it isn’t an empty match
COMMAND … “$file” …
fi
done

http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

What is IFS?

IFS (the “input field separator”) is an ancient, very standard, but not well-known capability of Bourne shells. After almost all substitutions, including command substitution ‘…‘ and variable substitution ${…}, the characters in IFS are used to split up any substitution results into multiple values (unless the results are inside double-quotes). Normally, IFS is set to space, tab, and newline – which means that by default, after almost all substitutions, spaces are interpreted as separating the substituted values into different values. This default IFS setting is very bad if file lists are produced through substitutions like command substitution and variable substitution, because filenames with spaces will get split into multiple filenames at the spaces (oops!). And processing filenames is really common.

# Correct glob use, but requires nonstandard bash extension:
shopt -s nullglob # Bash extension, so that empty glob matches will work
for file in ./* ; do # Use “./*”, NEVER bare “*”
COMMAND … “$file” …
done

# These handle all filenames correctly; can be unwieldy if COMMAND is large:
find … -exec COMMAND… {} \;
find … -exec COMMAND… {} \+ # If multiple files are okay for COMMAND

# This skips filenames with control characters (inc. tab and newline);
IFS=”$(printf ‘\n\t’)”
controlchars=”$(printf ‘*[\001-\037\177]*’)”
for file in $(find . ! -name “$controlchars”‘) ; do
COMMAND “$file” …
done

# Okay if filenames can’t contain tabs or newlines; beware the assumption:
IFS=”$(printf ‘\n\t’)”
for file in $(find .) ; do
COMMAND “$file” …
done

# Requires nonstandard but common extensions in find and xargs:
find . -print0 | xargs -0 COMMAND

# Requires nonstandard extensions to find and to shell (bash works);
# variables might not stay set once the loop ends:
find . -print0 | while IFS=”” read -r -d “” file ; do …
COMMAND “$file” # Use quoted “$file”, not $file, everywhere.
done

# Requires nonstandard extensions to find and to shell (bash works);
# underlying system must inc. named pipes (FIFOs) or the /dev/fd mechanism.
# In this version, variables, *do* stay set after the loop ends, and
# you can read from stdin (change the 4s to another number if fd 4 is needed):
while IFS=”” read -r -d “” file <&4 ; do COMMAND "$file" # Use quoted "$file", not $file, everywhere. done 4< <(find . -print0) # Named pipe version. # Requires nonstandard extensions to find and to shell's read (bash works); # underlying system must inc. named pipes (FIFOs). Again, # in this version, variables, *do* stay set after the loop ends, and # you can read from stdin (change the 4s to something else if fd 4 needed). mkfifo mypipe find . -print0 > mypipe &
while IFS=”” read -r -d “” file <&4 ; do COMMAND "$file" # Use quoted "$file", not $file, everywhere. done 4< mypipe # Requires author's nul2pfb program. This uses "find . -print0"; for # POSIX 2008 compliance, replace that with: find . -exec printf '%s\0' {} \; for encoded_filename in $(find . -print0 | nul2pfb) ; do filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}" # Use "$filename" from here on... done You might also want to put "set -eu" at the beginning of your scripts; it does nothing for filenames, but it can help detect other script errors. Prefix all globs/filenames A "glob" is a pattern for filename matching like "*.pdf". Whenever you use globbing to select files, never begin with a globbing character (typically the characters "*", "?", or "[") or with a value that might begin with "-". If you’re starting from the current directory, prefix the glob with "./". In short, use: cat ./* # Use this, NOT cat * ... Must have 1+ files. for file in ./* ; do # Use this, NOT "for file in *" (beware empty lists) ... done Remember that globbing normally skips hidden files (those beginning with "."). Beware of globbing if there might be no matches. By default, if a glob like ./*.pdf matches no files, then the original glob pattern will be returned instead. This is almost never what you want, e.g., in a "for" loop this will cause the loop to execute once, but with the pattern instead of a filename! You can use use globbing in a for loop, even if it might not match anything, using one of two approaches. One approach, which is completely portable, is to re-test for the existance of the file before using it in the loop: for file in ./* ; do # Use this, NOT "for file in *" if [ -e "$file" ] ; then # Make sure it exists and isn't an empty match COMMAND ... "$file" ... fi done A more efficient but nonstandard solution for empty matches is to use a nonstandard shell extension called "null globbing". Null globbing fixes this by replacing an unmatched pattern with nothing at all. In bash you can enable nullglob with "shopt -s nullglob". In zsh, you can use setopt NULL_GLOB for the same result. Then this will work correctly: shopt -s nullglob # Bash extension, so that empty glob matches will work for file in ./* ; do # Use this, NOT "for file in *" COMMAND ... "$file" ... done If the match might be empty, you should normally not use globbing as part of a command. Thus, use "cat *.pdf" only if you know there's at least one .pdf file. One exception: If you enable null globbing, and if the command does nothing when handed an empty list of files, then things will be fine. But this condition is often untrue, and in any case, if there are too many matches it will also fail. In short, in robust scripts, globbing should normally be used only as a "for" loop's list. Bash 4, at least, can get stuck in infinite loops if there are links. In many cases, find is currently the better approach for reliably doing recursive descent into directories. # If you begin your shell script with IFS="$(printf '\n\t')" (as recommended above), and filenames cannot include tab or newline, then you can use find in the "normal" way, either inside ‘...‘ or with a normal "for" loop: # CORRECT if filenames can't include tab/newline *and* if IFS omits space: COMMAND $(find .) # OR: for file in $(find .) ; do COMMAND "$file" ... done This is a simple and clear solution, and it can handle filenames with spaces, leading dashes, shell metacharacters, and so on. In short, this is the best and clearest solution for non-trivial processing as long as filenames cannot include tab or newline. Below we discuss how to ignore filenames with control characters (like tab or newline); if you add that, then this correctly handles or ignores all filenames. Note that ‘...‘ cannot handle lists that are too large; if that might be a problem, use the for loop instead. Similarly, as long as filenames can’t include tab or newline, you can store filenames in files with one record per newline-separated line, and tabs can separate the fields. This format that is well-supported by tools like cut, join, and paste. I think it’d be best if POSIX systems simply forbid filenames from including control characters like tab and newline; many programs assume it anyway, and many filesystems require it. Then these constructs would just work, no matter what. # Here’s another solution as long as filenames cannot include newline: find . | while IFS="" read -r file ; do ... COMMAND "$file" # Use "$file" not $file everywhere. done you could pipe filenames through another command like sed to do character substitutions, like this: find . | sed -e 's/[^A-Za-z0-9]/\\&/g' | xargs -E "" COMMAND This is complicated, hard to read, rediculously inefficient, and isn't better than many other alternatives (e.g., it doesn't handle newlines either). Don’t do this; instead, use one of the better ways described in this paper. Could the POSIX standard be changed to make file processing easier? The POSIX standard could (and should!) be modified to make it easier to to handle the outrageously permissive filenames that are permitted today. Basically, we need extensions to make globbing and find easier to use. # This works if filenames never begin with "-" and nullglob is enabled: for file in *.pdf ; do ... done # Use "$file" not $file # This works if filenames have no control chars and IFS is tab and newline: for file in $(find .) ; do ... done # Use "$file" not $file - David Wheeler

One Response to ILLEGAL CHARACTERS in Filenames

  1. StrongmanTech on June 6, 2011 at 3:34 pm

    Great article. Thank you.


    Thank you. -ed

Leave a Reply

We try to post all comments within 1 business day