Chapter 3 Data, Information, and Files 2 definitions for information and data: Data: The words, numbers, graphics that are entered into the computer to describe people, events, things. Information: the words, numbers, graphics that are the basis for making decisions. Often information is the result derived by processing data: Data: The words, numbers, graphics entered by a user into a program. Information: The words, numbers, graphics displayed or printed by a program. file: a named collection of program instructions or data that exists on a storage medium such as a disk. An example would be the words that comprise a course syllabus, or the words in a letter to your advisor. An executable file contains the instructions that tell a computer how to perform a specific task. Examples are a word processing program, operating system. To use an executable file, you run it. In DOS, an executable file ends with the 3-letter extension .EXE. To run it, you type the file name after the command-line prompt. In Windows, executable files have special-shape icons. You can run one, by double-clicking on the icon. The information in an executable file consists of a list of binary instructions for the computer (machine language instructions - sequences of 0's and 1's). Humans cannot read an executable file. A data file contains words, numbers, graphics that you can view, edit, save, print, or send electronically to another computer. You can create a data file using a word-processor, spreadsheet software, graphics software, or database program. You can save a data file on disk, retrieve it later, view it, modify it, and save it again. You usually need to view or update a data file using the same software package that was used to create it initially. For example, a Microsoft word data file can be opened and processed with Microsoft word. Another word-processing program can probably open it too, but it would have to convert the file before it could process it. You probably cannot view it using a spreadsheet or database program. Source files: A third category. is a source file. A source file contains instructions written in a programming language (not machine language) that must be translated by the computer into machine language before they can be executed. They are readable by humans and can normally be viewed by a word processor. If you write BASIC or COBOL programs, you will save them as source files. The file AUTOEXEC.BAT is an example of a source file - it contains instructions that help to customize your computer. It is automatically translated and executed whenever your computer is booted. Point to all this: You process data files, executable files, and source files differently; therefore, you (or the computer) must know ¥ the category of each file on your disk. ¥ If it is a data file, it must know the software package that created it. ¥ If it is a source file, you must know the language compiler for translating it. In the documentcentric approach (Windows '95, Mac OS) you select the file (by clicking on it) and the operating system selects and runs the appropriate application or system program to process it. In older operating systems, you load the application program first and then try to "open" or retrieve the file with it. To enable the computer to practice the documentcentric approach or to help you determine the kind of file, you should know and follow system filenaming conventions. A filename in DOS or Windows 3.1 consists of 8 or fewer characters - letters, digits, special symbols, but no spaces. There is a 3-letter extension which is separated from the filename by a period. The extension often identifies the file category. .EXE or .COM- executable file you can load and run .SYS, .DRV, .DLL - executable files you cannot run Usually you will not create or name executable files. You will name Microsoft word files. If you do, use the extension .Doc . You don't need to specify the extension, Microsoft word will automatically give you .Doc to identify this file as a Word file. For text files - files of regular characters but no special wordprocessing characters - use the extension .Txt . Most word processors can open text files as well as their own document files. Legal names: Illegal names reason Memo_1.Doc Englishpaper.Doc - too long CIS10lab.Doc My paper.Doc - space 35words.Doc Eng/pap.Text - slash and 4-character extension Use of Wildcards: Enable you to list a collection of files. If you specify the file name my_text.Doc you will see a list with just this one file. If you type filename my_text.* you will see a list of all files with the filename my_text and any extension. If you type *.Doc you will see a list of all files with extension .Doc. If you type *.* you will see a list of all your files. How does the computer system organize the files on your disk. There may be 100s of files and several different storage devices with files - how do you and the computer keep all this straight? Each storage device has a unique name: A: and B: for floppy drives, C: for the primary hard drive, and letters D: through Z: are available for other drives. The operating system maintains a list of files called a directory for each storage device. The directory listing for a file shows the filename, its extension, size, date of last save, and an icon for it (Windows): my_paper.Doc 5028 10/31/95 9:53pm The main directory, or root directory, usually contains several files and many folders, or subdirectories, for storing additional files to help you organize the data on a disk. You create your own list of subdirectories. A subdirectory can have its own subdirectories. The complete file specification (used to access a file) consists of its pathname followed by its file name. C: School Businesss Home CIS10 Acct10 Lab_1.Doc Chap1not.Doc Specification for file Chap1not.Doc: C:\School\CIS10\Chap1not.Doc pathname filename.extension In Windows 3.1, you use the file manager to see all files in a particular directory or subdirectory. The screen is split into a left and right half: folder C: selected folder C:\ School ----- folder School Business folder CIS 10 Home folder Acct 10 ----- folder Business ----- folder Home Under folder C:, you would see the 3 subdirectories School, Business, Home listed on the right and the subdirectory structure listed on the left. If you clicked on School, the folder icon on the left would open, and you would see the 2 subdirectories CIS10, Acct 10 listed on the right. folder School CIS10 Acct 10 If you clicked on CIS10, you would see the 2 files listed on the right. folder CIS10 Lab_1.Doc Chap1not.Doc Storage Technologies storing data : called writing data or saving a file retrieving data: called reading data, loading data, opening a file We rate storage devices based on their capacity, access time, and transfer rate: Storage capacity: 1 byte - storage for a single character 1 kilobyte - storage for 1,024 bytes or ~1000 characters 1 megabyte - storea for 1 Million bytes 1 Gigabyte - storage for 1000 Megabytes or 1 billion bytes (109 bytes) Access time is average time it takes to locate data and read it - on the order of 10 milliseconds (10 - 1,000ths of a second). Compared to the time it takes to access data in main memory (< 1 millionth of a second), this is a long time. Transfer rate is the amount of data that can be transferred in a particular time period, for example 5 -10 Megabytes per second. For large quantities of storage: 3 kinds of medium, CD-ROM, magnetic disks, tape. Tapes and disks are magnetic media. ON a magnetic medium, particles on the surface are magnetized so that they point in one orientation for a 1 and a different orientation for a 0 - average lifetime of data is about 3 years as magnetic media may gradually lose its magnetic charge. It is relatively easy to change the orientation of particles, so you can easily rewrite the data on a magnetic disk or tape. Tapes are slower than disks. Also, tapes are used primarily for backup - read or written from beginning to end (sequential access). Disks are random access devices -which means you can access any byte on the disk at any time. When writing data onto an optical storage medium, a laser beam is used to burn pits into a reflective surface. During reading, laser beams are pointed at the 8 spots that represent a byte of data. Spots that are pits do not reflect light (read as 0). Spots that are not pits reflect light (read as 1). A CD-ROM will last forever. It does not have the access speed and transfer rate of a magnetic disk, but it is much faster than tape and does permit random access. Also it is cheaper to purchase a CD rom than a disk with the same capacity. Disks are rotated on a spindle until the data you are asking for is over the read/write head and then the data is transferred. Floppy disks have one plastic sheet covered with magnetic material on both sides - you can store data on both sides. Formatting creates a series of concentric circles on the disk called tracks. A disk is divided into wedge shaped sectors which can store 512 bytes each. Low density 51/4 "disks - 40 tracks, 9 sectors or a total of 360 per side, 360 x 1 K = 360 K bytes capacity. High density 3.5 " disks - 80 tracks, 18 sectors pers side, 144 per side x 1024 = 1.44 Megabytes Floppy disks are too slow to use for much more than backup and transferring files from one computer to another. Floppy disks were used for distribution of software - but as software packages are becoming larger, CD roms are used instead. YOu can protect the data on a floppy disk from being overwritten by moving a plastic piece so that light can shine through it and be detected by a photocell inside the disk drive. The read-write heads move over the surface and detect the magnetic charges in all bytes of the selected sector (reading) or magnetize the particles in each byte of the sector (writing). Removable hard drives are becoming very popular. Consist of a rigid metal platter that can store data on both sides. Because it is metal, not plastic, storage capacities are much larger and speeds are close to fixed disk speeds. Currently available from 100 Megabytes to 1,000 Megabytes. Hard disks consist of a stack of rigid metal platters that rotate as a unit. Each platter is divided into tracks and sectors like a floppy. Storage densities are much higher - more bytes per sector. There is a read_write head for the top and bottom of each platter. So if a disk has 8 platters, there are 16 surfaces for data storage and 16 read-write heads. The stack of platters rotates continuously. The read-write heads move in or out as a unit. The vertical stack of tracks accessed when the read-write heads are at a particular position is called a cylinder. To access a file, the drive must be sent the cylinder number, sector, and platter in which it is stored. What is a disk cache? A disk cache is an area of memory that is used to hold extra data transferred from disk. When your program requests a sector from a disk, that sector is tranferred into memory and at the same time - one or more adjacent sectors is transferred into the disk cache. When you request another sector from the disk, it is very likely that it will be a sector adjacent to the last one, so the computer can copy it from the disk cache into main memory in a microsecond (millionth of a second) instead of having to spend several milliseconds accessing it on the disk. Physical file storage: Files are actually stored in one or more clusters on the disk. A cluster is a group of sectors and is the smallest storage unit on disk that the computer can access. A cluster is 2 sectors on the IBM and the clusters are all numbered. When a file is stored on disk, the OS checks the FAT (File Access Table) to find an empty cluster. Places the number of the cluster that contains the beginning of the file in the Directory along with the filename: Directory Filename Starting cluster resume. txt 3 cis10lab.txt 5 notes.txt 7 The file access table shows a cluster's status. There are 4 possibilities: 1. empty (available for storage of a file) - status is 0 2. reserved for a system (not a user) file - status is 1. 3. contains the last segment of a file - status is 999 4. points to the next segment of a file (if not the last segment) - status is #of cluster that contains the next segment. Example: cluster # status 1 1 2 1 3 4 4 999 5 6 6 8 7 9 8 999 9 10 10 11 11 999 When a file is removed, its clusters' status gets changed. For example, if we delete cis10lab.txt, the new FAT would be: cluster # status 1 1 2 1 3 4 4 999 5 0 6 0 7 9 8 0 9 10 10 11 11 999 The data in clusters 5, 6, and 8 is not actually erased but there is no way to access it except through a special utility program. It is good when file segments are stored in consecutive clusters. That way the read/write head does not have to move much to access it. After much use of a disk, it becomes fragmented which means there are not large groups of contigous clusters, so the clusters for a large file may be stored all over the disk. Accessing the file clusters becomes less efficient when a disk is fragmented. Defragmentaton utilities can fix this.