Linux Systems Programming by RL
20 -System Programming is the art of writing system software.
-System software lives at a low level, interfacing directly with the kernel and core system libraries.
-examples of system software: shell, text editor, compiler etc.
-system software is the heart of all software.
21 -system software has a strong awareness of the hardware and OS.
-Much of system software is written in C and its libraries.
-As opposed to system software, there is application software written in languages like PHP, JavaScript, Java etc.
-Linux is Unix like, not Unix.
-Linux follows its own course, diverging where desired and converging only where practical.
22 -Three cornerstones to Linux system programming:
. system calls or syscalls
. C library
. C compiler
-system calls or syscalls:
are function invocations made from the userspace to kernel to request
some service from the OS.
-eg your text editor uses read() syscall to read a file.
-Linux has far fewer syscalls (in hundreds) than windows (that has thousands).
-some linux syscalls are architecture specific (eg only for alpha or intel)
but most (90% +) are common.
-system calls are denoted by a number, starting with 0.
-for reasons of security and reliability, userspace programs are not allowed to directly call kernel. They issue interrupts to to 'request' kernel attention.
23 -The C library:
aka glibc (or libc), is a set of core services and functions that
facilitate system call invocation.
-The C compiler:
aka gcc is a set of programs allowing to compile system and userspace
programs.
-APIs and ABIs:
.Application Program Interface & Application Binary Interface.
These are programming routines that allow communication between software on
the source code or binary levels.
-APIs and ABIs are important from programs portability across platforms.
25
-Linux is not officially Posix compliant.
-The term 'Posix' was suggested by Richard Stallman of FSF.
-Commands to see the versions of:
. linux - uname -r -ours is 3.6
. gcc - gcc --version -ours is 4.7
. glibc - ldd --version -ours is 2.15
28 -Everything in linux is a file. And all files are streams of bytes.
-in order to deal with files they must be opened for read &/or write.
-files are referenced by a descriptor (a pointer?)
-file descriptors are shared between kernel and userspace.
29 -The file operations start at a certain position in the file.
-usually this position is zero. And it cannot be negative.
-a single file can be opened multiple times, even by the same process.
-although files are accessed by names, they are not actually related.
-rather files are directly related to the inode.
-The inode of a file has its accounting info (who, what, when, etc)
-the inode does not have a file's name.
-directories are used to provide the names associated to files.
-userspace programs access files by names.
-so a directory can be thought of a file with mapping of filenames to inodes.
-eg: when a userspace program requests a file (from the kernel),
the kernel opens the directory and searches the name to get its inode.
-when two filenames point to the same inode, they are hard links.
-inodes are meaningless outside the filesystem, so hardlinks cant span fs.
-when two filenames point to diff inodes but one contains the path of other,
they are called symbolic links.
-in other words, symbolic links have diff inodes and can span fs.
-Filesystems:
-a hierarchical arrangement of files (on linux, everything is a file)
-The smallest addressable unit on a block device is the sector. The sector is a physical quality of the device.
-Likewise, the smallest logically addressable unit on a filesystem is the block. The block is an abstraction of the filesystem, not of the physical media on which the file-system resides. A block is usually a power-of-two multiple of the sector size.
-Blocks are generally larger than the sector, but they must be smaller than the page size* (the smallest unit addressable by the memory management unit, a hardware component).
-in nutshell, memory pages > fs blocks > device sectors.
-Processes:
-If files are the most fundamental abstraction in a Unix system, processes are the second most fundamental.
-processes are object code in execution; plus they also have data.
-Processes exist in a virtual system. From their perspective they are the only
process on the system. They never know the difference because the kernel
seamlessly preempts and schedules processes.
-each process has its own independent address space in memory.
-Threads:
-threads are the unit of activity within a process, an abstraction responsible
for executing code and maintaining a processes' running state.
-while threads of a process share some components of the process (like global variables) they have some independent components like the stack that stores local variables.
-note about directory permissions:
For directories, read permission allows the contents of the directory to be listed, write permission allows new links to be added inside the directory, and execute permission allows the directory to be entered and used in a pathname.
-note about signals:
Signals are a mechanism for one-way asynchronous notifications. A signal may be sent from the kernel to a process, from a process to another process, or from a process to itself. Signals typically alert a process to some event,
eg ctrl+z. Linux has about 30 signals.
-note about errors:
exit status 0 is generally treated as success and a non zero value like -1
is treated as failure.
-In Linux, the kernel maintains a per-process list of open files called the
file table. Its indexed by a file descriptor that is a positive int.
-The fd serves as a pointer to the file.
-both userspace and kernel space use the file descriptors or fd.
-by default a child process receives a copy of its parents file table.
-Each linux process has a max number of files that it can open usually 1024.
-fd 0=stdin, fd 1=stdout, fd 2=stderr. Each process has these 3.
-file open() system call:
-syntax: int open (const char *name, int flags, mode_t mode);
-eg: fd = open ("/home/kidd/madagascar", O_RDONLY); //fd = file desc.
-file ownership:
-the uid of a file's owner is the effective uid of the process creating the file. gid is usually the effective gid of the process creating the file.
-umask: the umask exists to allow the user to limit the permissions that his programs set on new files.
-file creat() system call: //yes no 'e' in creat()
-syntax: int creat (const char *name, mode_t mode);
-eg: fd = creat (file, 644);
-file read() system call:
-syntax: ssize_t read (int fd, void *buf, size_t len);
-eg: ssize_t nr;
nr = read (fd, &word, sizeof(unsigned long));
-note: size of a datatype depends on the processor.
-eg: unsigned long type is 4 bytes on 32-bit systems, 8 bytes on 64-bit sys.
-file write() system call:
-syntax: ssize_t write (int fd, const void *buf, size_t count);
-eg: ssize_t nr;
nr = write (fd, buf, strlen (buf));
-writes are usually delayed writes, whereby kernel accumulates data blocks in
before writing them to disk.
-there are however, times when applications want to control when kernel writes
data to disks.
-And for those situations, Linux provides fsync() and fdatasync() syscalls.
-syntax: int fsync(int fd);
int fdatasync(int fd);
-direct IO:
-if an application wishes to perform its own IO bypassing the os provided
mechanisms, then they need to request the kernel to call the file open()
function with O_DIRECT flag. With this, I/O will initiate directly from user-
space buffers to the device, bypassing the page cache. All I/O will be synchronous; operations will not return until completed.
-file close() system call:
-after a program has finished working with a file descriptor, it needs to
unmap the file descriptor from the file using close().
-syntax: int close (int fd);
-return value -1 is error, 0 is success.
-lseek() system call:
-while normally IO in a file occurs linearly, some applications need to go to
a random location in the file. This is possible with the lseek() syscall.
-lseek() does not initiate IO, it just updates the file pointer position.
-syntax: off_t lseek(int fd, off_t pos, int origin);
-eg: off_t ret;
ret = lskee(fd, (off_t) 109, SEEK_SET); //set the fd pos 109.
-Now it is possible to seek beyond the end of file (EOF). Why? well sometimes
apps will do so. In such a case, it depends if the request is read or write.
If it is read, then fd will return EOF.
If it is write, then fd will advance to the new position and the EOF to new
position will be padded with zeroes. This zero padding is called hole. This
can mean that the total filesystem space > diskspace. Manipulating holes does
not initiate any physical IO and they can actually improve performance.
-In addition to this in Linux, you can have variants of read() and write() that each take a parameter the file position from where to start read/write.
-these two are pread() and pwrite().
-eg: ssize_t pread (int fd, void *buf, size_t count, off_t pos);
-Multiplexed IO.
-With complex applications like GUI, often require multiple file descriptors
to handle multiple IO like from keyboard, mouse and processes internally.
This is where multiplexed IO comes into play.
-Multiplexe IO allows applications to concurrently block on multiple file descriptors, receive notifications from any of them and read/write without
blocking each other.
-Linux provides three multiplexed IO solutions:
.select()
.poll()
.epoll() //event poll
-select()
-eg: int select (int n, fd_set *readfs,fd_set *writefs,fd_set *exceptfs,
struct timeval *timeout)
-as can be seen, select IO watches for read, write, exceptions & timeouts all
at the same time.
-poll()
-the difference between select() and poll() is that poll() uses an array of
fds (instead of 3 separate ones for read, write, exception)
-eg: struct pollfd{
int fd; //file descriptor
short events; //events to watch
short revents; //returned events
};
-epoll()
-while both select() and poll() do the same job of multiplexed IO, they still
require a host of file desctiptors to list and watch events.
-epoll() is different in that it separates monitoring events from actual monitoring. What that means, is while one syscall intializes epoll, it does
not stay back to monitor fds, another syscall does so, and a third actually
performs it.
-eg: int epoll_create(int size)
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
int timeout);
-Buffered IO
-all disk operations can be measured in terms of blocks.
-An optimal situation for IO emerges if syscalls request integer multiples of
data blocks in the minimum number of iterations. This is because in that case, the kernel doesnt have to spin cycles doing unneeded calculations.
-If programs dont have the ability to do so, the system tries the second
approach of buffered IO, in which data is cached in buffer until they reach
the optimal size of multiples of block size.