We came across a strange bug today when exporting some data from a database into Excel. When we viewed the text through our web interface, it showed as the equivalent of “The quick brown fox jumped over the lazy dog.” However, when we exported the data to excel, it showed up as “The quick brown foxÂjumped over the lazy dog.”
What was that funny character? It wasn’t visible in the text field in our web interface.
We traced the source of the data back to that most dreaded text formatting tool – Microsoft Word. Word always causes me headaches with its constant need to fix everything, turning my normal quotes (“) into curly quotes (“”) or hypens (-) into m-dashes (—), so I immediately suspected it was the source of the problem.
I turned on “Show all printing characters”, and sure enough, there was a strange mark at that spot:
But what was it? I am used to the small dot to indicate a space, but this looked more like a tilde character:
I had never seen it before. Without the non-printing characters on the screen, it looked just like a space. But clearly Word understood it to be something different.
I checked a variety of lists of non-printable characters Word uses, but it never seemed to appear. I then googled the “Â” and came across a Stack Overflow discussion amongst some engineers about what it was.
It turns out that is a non-breaking space. Web developers are very familiar with it as the “ ” string, which is used to force the web browser to render a space. In HTML, all spaces are considering non-essential, so if you have five in a row, the browser will just render a single space. You use a non-breaking space when you want several spaces to really be rendered.
Microsoft Word supports them, but for their original purpose – non-breaking spaces. The idea of a non-breaking space is that it separates two words but does not allow the text layout engine to choose it as a place for a line break. For example, if you wanted to make sure that “New York” was treated as a single phrase and did not wrap text in the middle of it (with one line ending on “New” and the next one starting on “York”), you could put a non-breaking space in there to stop it.
It seems that somehow a non-breaking space was introduced into our document. It doesn’t appear to have been Word making changes automatically like it does with quotation marks , since it is only at this particular place. How did they get in there?
Then, it dawned on me. Perhaps the problem was before Word. Where had I seen non-breaking spaces before? They are very common in HTML. As an experiment, I created a simple web page that had a single line of text with an “ ” at one spot instead of a space. I then opened it in a web browser, copied the text, and pasted it into Word. I then turned on “Show all nonprinting characters”, and there was the squiggle. The non-breaking space had been copied into Word.
So, this bad character had been passed through generations of systems from an HTML page to Word to our web application to our database to the generated report to excel. It was like a genetic defect, passed down hidden through generations until it finally manifested.
So, the moral of the story: be very careful when copying from HTML pages and pasting them into Word.