Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I saw this pattern used for a regular expression in which the goal was to remove non-ascii characters from a string. What does it mean?

It says something like: all characters that are not ( ^ ) in the range \x20-\x7E (hex 0x20 to 0x7E ).

According to http://www.asciitable.com/ , those are characters from space to ~ .

It's good to note that you can also use APIs to do the check. For instance, in Java you can use java.lang.Character.isISOControl(character) and similar methods that make your code more readable. Stan Nov 14, 2016 at 9:56 There are languages that does not have a API for such purposes. In PHP you have to do something like this to extract special characters:preg_replace_callback('/[^\x20-\x7f]/', function($match) { return DO_SOMETHING_WITH_SPECIAL_CHARS($match[0]); }, $url); Hermann Schwarz Nov 26, 2019 at 19:50

It means match any characters that are not printing characters.

Printing characters include a to z, A to Z, 0 to 9 and symbols such as ",;$#% etc.

^ not
\x20 hex code for space character
\x7e hex code for ~ (tilde) character

All the ascii printing characters fall between these two.

This statement matches non ascii characters as well as ascii control (non printing) characters such as bell, tab, null and others.

Look at

man ascii

on a unix system to see which characters it matches.

In perl, you could also write this as

[^ -~]
[[:^cntrl:]]

This last one is slightly different, in that it matches any non control character, including extended ascii (e.g. accented characters) and unicode.

You may not want to restrict yourself to just ascii, since non US locations often use valid printing characters outside this small range, e.g. øüéåç...

I think you meant [^[:print:]] for that last one. POSIX character class notation includes the square brackets as well as the colons, and the whole thing has to be placed inside another set of square brackets. (And of course, [:cntrl:] is the wrong class.) However, POSIX classes are also supposed to be locale-sensitive, which means they could match, eg, accented letters as well as the basic ASCII set. – Alan Moore Jun 12, 2009 at 4:37 Ah yes, that was sloppy (it was late). cntrl is indeed different to the previous ones, in the sense that it will include printing characters in the extended ascii and even unicode ranges, but I believe that it's likely that's what was intended. – Alex Brown Jun 12, 2009 at 6:52 I would advise leaving the POSIX character classes alone, especially in a case like this, where we don't know which regex flavor is being used, which OS it's running on, or in which locale. All of those factors can change their behavior. – Alan Moore Jun 12, 2009 at 11:48 This answer is most completed: [^\x20-\x7E] is not restricted to 'not printing characters'. It matches also to NON-ASCII characters like 'ä', 'ö' et cetera. I think, one uses [^\x20-\x7E] often to do something with such special characters (umlauts). – Hermann Schwarz Nov 26, 2019 at 20:02

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.