Ascii vs. Binary Files
Ascii vs. Binary Files
Introduction
Most people classify files in two categories: binary files and ASCII (text) files. You've actually worked with both. Any program you write (C/C++/Perl/HTML) is almost surely an ASCII file.
An ASCII file is defined as a file that consists of ASCII characters. It's usually created by using a text editor like emacs, pico, vi, Notepad, etc. There are fancier editors out there for writing code, but they may not always save it as ASCII.
As an aside, ASCII text files seem very "American-centric". After all, the 'A' in ASCII stands for American. However, the US does seem to dominate the software market, and so effectively, it's an international standard.
Computer science is all about creating good abstractions. Sometimes it succeeds and sometimes it doesn't. Good abstractions are all about presenting a view of the world that the user can use. One of the most successful abstractions is the text editor.
When you're writing a program, and typing in comments, it's hard to imagine that this information is not being stored as characters. Of course, if someone really said "Come on, you don't really think those characters are saved as characters, do you? Don't you know about the ASCII code?", then you'd grudgingly agree that ASCII/text files are really stored as 0's and 1's.
But it's tough to think that way. ASCII files are really stored as 1's and 0's. But what does it mean to say that it's stored as 1's and 0's? Files are stored on disks, and disks have some way to represent 1's and 0's. We merely call them 1's and 0's because that's also an abstraction. Whatever way is used to store the 0's and 1's on a disk, we don't care, provided we can think of them that way.
In effect, ASCII files are basically binary files, because they store binary numbers. That is, ASCII files store 0's and 1's.
The Difference between ASCII and Binary Files?
An ASCII file is a binary file that stores ASCII codes. Recall that an ASCII code is a 7-bit code stored in a byte. To be more specific, there are 128 different ASCII codes, which means that only 7 bits are needed to represent an ASCII character.
However, since the minimum workable size is 1 byte, those 7 bits are the low 7 bits of any byte. The most significant bit is 0. That means, in any ASCII file, you're wasting 1/8 of the bits. In particular, the most significant bit of each byte is not being used.
Although ASCII files are binary files, some people treat them as different kinds of files. I like to think of ASCII files as special kinds of binary files. They're binary files where each byte is written in ASCII code.
A full, general binary file has no such restrictions. Any of the 256 bit patterns can be used in any byte of a binary file.
We work with binary files all the time. Executables, object files, image files, sound files, and many file formats are binary files. What makes them binary is merely the fact that each byte of a binary file can be one of 256 bit patterns. They're not restricted to the ASCII codes.
Example of ASCII files
Suppose you're editing a text file with a text editor. Because you're using a text editor, you're pretty much editing an ASCII file. In this brand new file, you type in "cat". That is, the letters 'c', then 'a', then 't'. Then, you save the file and quit.
What happens? For the time being, we won't worry about the mechanism of what it means to open a file, modify it, and close it. Instead, we're concerned with the ASCII encoding.
If you look up an ASCII table, you will discover the ASCII code for 0x63, 0x61, 0x74 (the 0x merely indicates the values are in hexadecimal, instead of decimal/base 10).
Here's how it looks:
| ASCII | 'c' | 'a' | 't' |
| Hex | 63 | 61 | 74 |
| Binary | 0110 0011 | 0110 0001 | 0111 1000 |
Each time you type in an ASCII character and save it, an entire byte is written which corresponds to that character. This includes punctuations, spaces, and so forth. I recall one time a student has used 100 asterisks in his comments, and these asterisks appeared everywhere. Each asterisk used up one byte on the file. We saved thousands of bytes from his files by removing comments, mostly the asterisks, which made the file look nice, but didn't add to the clarity.
Thus, when you type a 'c', it's being saved as 0110 0011 to a file.
Now sometimes a text editor throws in characters you may not expect. For example, some editors "insist" that each line end with a newline character.
What does that mean? I was once asked by a student, what happens if the end of line does not have a newline character. This student thought that files were saved as two-dimensions (whether the student realized ir or not). He didn't know that it was saved as a one dimensional array. He didn't realize that the newline character defines the end of line. Without that newline character, you haven't reached the end of line.
The only place a file can be missing a newline at the end of the line is the very last line. Some editors allow the very last line to end in something besides a newline character. Some editors add a newline at the end of every file.
Unfortunately, even the newline character is not that universally standard. It's common to use newline characters on UNIX files, but in Windows, it's common to use two characters to end each line (carriage return, newline, which is \r and \n, I believe). Why two characters when only one is necessary?
This dates back to printers. In the old days, the time it took for a printer to return back to the beginning of a line was equal to the time it took to type two characters. So, two characters were placed in the file to give the printer time to move the printer ball back to the beginning of the line.
This fact isn't all that important. It's mostly trivia. The reason I bring it up is just in case you've wondered why transferring files to UNIX from Windows sometimes generates funny characters.
Editing Binary Files
Now that you know that each character typed in an ASCII file corresponds to one byte in a file, you might understand why it's difficult to edit a binary file.
If you want to edit a binary file, you really would like to edit individual bits. For example, suppose you want to write the binary pattern 1100 0011. How would you do this?
You might be naive, and type in the following in a file:
11000011 |
But you should know, by now, that this is not editing individual bits of a file. If you type in '1' and '0', you are really entering in 0x49 and 0x48. That is, you're entering in 0100 1001and 0100 1000 into the files. You're actually (indirectly) typing 8 bits at a time.
"But, how am I suppose to edit binary files?", you exclaim! Sometimes I see this dilemma. Students are told to perform a task. They try to do the task, and even though their solution makes no sense at all, they still do it. If asked to think about whether this solution really works, they might eventually reason that it's wrong, but then they'd ask "But how do I edit a binary file? How do I edit the individual bits?"
The answer is not simple. There are some programs that allow you type in 49, and it translates this to a single byte, 0100 1001, instead of the ASCII code for '4' and '9'. You can call these programs hex editors. Unfortunately, these may not be so readily available. It's not too hard to write a program that reads in an ASCII file that looks like hex pairs, but then converts it to a true binary file with the corresponding bit patterns.
That is, it takes a file that looks like:
63 a0 de |
and converts this ASCII file to a binary file that begins 0110 0011 (which is 63 in binary). Notice that this file is ASCII, which means what's really stored is the ASCII code for '6', '3', ' ' (space), 'a', '0', and so forth. A program can read this ASCII file then generate the appropriate binary code and write that to a file.
Thus, the ASCII file might contain 8 bytes (6 for the characters, 2 for the spaces), and the output binary file would contain 3 bytes, one byte per hex pair.
Viewing Binary Files
Most operating systems come with some program that allows you to view a file in "binary" format. However, reading 0's and 1's can be cumbersome, so they usually translate to hexadecimal. There are programs called hexdump which come with the Linux distribution or xxd.
While most people prefer to view files through a text editor, you can only conveniently view ASCII files this way. Most text editors will let you look at a binary file (such as an executable), but insert in things that look like ^@ to indicate control characters.
A good hexdump will attempt to translate the hex pairs to printable ASCII if it can. This is interesting because you discover that in, say, executables, many parts of the file are still written in ASCII. So this is a very useful feature to have.
Writing Binary Files, Part 2
Why do people use binary files anyway? One reason is compactness. For example, suppose you wanted to write the number 100000. If you type it in ASCII, this would take 6 characters (which is 6 bytes). However, if you represent it as unsigned binary, you can write it out using 4 bytes.
ASCII is convenient, because it tends to be human-readable, but it can use up a lot of space. You can represent information more compactly by using binary files.
For example, one thing you can do is to save an object to a file. This is a kind of serialization. To dump it to a file, you use a write() method. Usually, you pass in a pointer to the object and the number of bytes used to represent the object (use the sizeof operator to determine this) to the write() method. The method then dumps out the bytes as it appears in memory into a file.
You can then recover the information from the file and place it into the object by using a corresponding read() method which typically takes a pointer to an object (and it should point to an object that has memory allocated, whether it be statically or dynamically allocated) and the number of bytes for the object, and copies the bytes from the file into the object.
Of course, you must be careful. If you use two different compilers, or transfer the file from one kind of machine to another, this process may not work. In particular, the object may be laid out differently. This can be as simple as endianness, or there may be issues with padding.
This way of saving objects to a file is nice and simple, but it may not be all that portable. Furthermore, it does the equivalent of a shallow copy. If your object contains pointers, it will write out the addresses to the file. Those addresses are likely to be totally meaningless. Addresses may make sense at the time a program is running, but if you quit and restart, those addresses may change.
This is why some people invent their own format for storing objects: to increase portability.
But if you know you aren't storing objects that contain pointers, and you are reading the file in on the same kind of computer system you wrote it on, and you're using the same compiler, it should work.
This is one reason people sometimes prefer to write out ints, chars, etc. instead of entire objects. They tend to be somewhat more portable.
Summary
An ASCII file is a binary file that consists of ASCII characters. ASCII characters are 7-bit encodings stored in a byte. Thus, each byte of an ASCII file has its most significant bit set to 0. Think of an ASCII file as a special kind of binary file.
A generic binary file uses all 8-bits. Each byte of a binary file can have the full 256 bitstring patterns (as opposed to an ASCII file which only has 128 bitstring patterns).
There may be a time where Unicode text files becomes more prevalent. But for now, ASCII files are the standard format for text files.
Ascii vs. Binary Files的更多相关文章
- How can I read binary files from Resources
How can I read binary files from Resourceshttp://answers.unity3d.com/questions/8187/how-can-i-read-b ...
- text files and binary files
https://en.wikipedia.org/wiki/Text_file https://zh.wikipedia.org/wiki/文本文件
- mysql(或者mariadb)连接工具HeidiSQL
Some infos around HeidiSQL Project website: http://www.heidisql.com/Google Code: http://code.google. ...
- Linux学习笔记:ftp中binary二进制与ascii传输模式的区别
在使用ftp传输文件时,常添加上一句: binary -- 使用二进制模式传输文件 遂查资料,如下所获. FTP可用多种格式传输文件,通常由系统决定,大多数Linux/UNIX系统只有两种模式:文本 ...
- Linux——grep binary file
原创声明:本文系博主原创文章,转载或引用请注明出处. grep命令是linux下常用的文本查找命令.当grep检索的文件是二进制文件时,grep命令会提示: $grep pattern filenam ...
- Debugging Information in Separate Files
[Debugging Information in Separate Files] gdb allows you to put a program's debugging information in ...
- ftp二进制与ascii传输方式区别
ASCII 和BINARY模式区别: 用HTML 和文本编写的文件必须用ASCII模式上传,用BINARY模式上传会破坏文件,导致文件执行出错. BINARY模式用来传送可执行文件,压缩文 ...
- rpmbuild spec 打包jar变小了、设置禁止压缩二进制文件Disable Binary stripping in rpmbuild
Disable Binary stripping in rpmbuild 摘自:http://livecipher.blogspot.com/2012/06/disable-binary-stripp ...
- reading/writing files in Python
file types: plaintext files, such as .txt .py Binary files, such as .docx, .pdf, iamges, spreadsheet ...
随机推荐
- FAST Hello World - Preparation for software's running environment
Ubuntu 14.04 安装 libpcap-1.1.1 & libpnet-1.1.4 & NMAC function lib 参考: NetMagic.org yacc: com ...
- NYOJ 16 矩形嵌套(经典DP)
http://acm.nyist.net/JudgeOnline/problem.php?pid=16 矩形嵌套 时间限制:3000 ms | 内存限制:65535 KB 难度: ...
- Java 文件夹递归遍历
import java.io.File; public class Demo1 { public static void main(String[] args) { File dir=new File ...
- python 获取文件的修改时间
os.path.getmtime(name) #获取文件的修改时间 os.stat(path).st_mtime#获取文件的修改时间 os.stat(path).st_ctime #获取文件修改时间 ...
- python 加密与解密
加密算法分类 对称加密算法: 对称加密采用了对称密码编码技术,它的特点是文件加密和解密使用相同的密钥 发送方和接收方需要持有同一把密钥,发送消息和接收消息均使用该密钥. 相对于非对称加密,对称加密具有 ...
- java基础学习总结——线程(一)
一.线程的基本概念
- 查找SQL Server 自增ID值不连续记录
在很多的时候,我们会在数据库的表中设置一个字段:ID,这个ID是一个IDENTITY,也就是说这是一个自增ID.当并发量很大并且这个字段不是主键的时候,就有可能会让这个值重复:或者在某些情况(例如插入 ...
- 关于angular5的惰性加载报错问题
之前为了测试一个模块优化问题,于是用angular-cli快速搭建了个ng5的脚手架demo,在应用惰性加载功能的时候发现浏览器报错如下: ERROR Error: Uncaught (in prom ...
- TimeZone 时区 (JS .NET JSON MYSQL) + work week 闰年
来源参考 : http://www.cnblogs.com/qiuyi21/archive/2008/03/04/1089456.html 来源参考 : http://walkingice.blogs ...
- C++ 多态性和虚函数
2017-06-27 19:17:52 C++面向对象编程的一个重要的特性就是多态性,而多态性的实现需要依赖虚函数的帮助. 一.多态的作用: 隐藏实现细节,使得代码能够模块化: 接口重用,实现“一个接 ...