|
In computer programming and formal language theory, (and other branches of mathematics), a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set. Computer programming (often shortened to programming or coding) is the process of writing, testing, and maintaining the source code of computer programs. ...
In mathematics, logic, and computer science, a formal language is a set of finite-length words (i. ...
Euclid, Greek mathematician, 3rd century BC, as imagined by by Raphael in this detail from The School of Athens. ...
In mathematics, a sequence is a list of objects (or events) arranged in a linear fashion, such that the order of the members is well defined and significant. ...
In mathematics, a set can be thought of as any collection of distinct objects considered as a whole. ...
In programming, when stored in memory each symbol is represented using a numeric value. A variable declared to have a string datatype usually causes storage to be allocated in memory that is capable of holding some predetermined number of symbols. When it appears in source code a string is known as a string literal and has a representation that denotes it as such. Sometimes the term binary string is used to refer to an arbitrary sequence of bits. In computer science and mathematics, a variable (IPA pronunciation: ) (sometimes called a pronumeral) is a symbolic representation denoting a quantity or expression. ...
In computer science, a datatype or data type (often simply a type) is a name or label for a set of values and some operations which one can perform on that set of values. ...
Source code (commonly just source or code) is any series of statements written in some human-readable computer programming language. ...
A string literal is the representation of a string value within the source code of a computer program. ...
Formal theory Let Σ be an alphabet, a non-empty finite set. Elements of Σ are called symbols or characters. A string (or word) over Σ is any finite sequence of characters from Σ. For example, if Σ = {0, 1}, then 0101 is a string over Σ. In computer science, an alphabet is a finite set of characters or digits. ...
The empty set is the set containing no elements. ...
In mathematics, a set is called finite if there is a bijection between the set and some set of the form {1, 2, ..., n} where is a natural number. ...
In mathematics, a set can be thought of as any collection of distinct objects considered as a whole. ...
In mathematics, a sequence is a list of objects (or events) arranged in a linear fashion, such that the order of the members is well defined and significant. ...
The length of a string is the number of characters in the string (the length of the sequence) and can be any non-negative integer. The empty string is the unique string over Σ of length 0, and is denoted ε or λ. For other uses of this word, see Length (disambiguation). ...
Natural number can mean either a positive integer (1, 2, 3, 4, ...) or a non-negative integer (0, 1, 2, 3, 4, ...). Natural numbers have two main purposes: they can be used for counting (there are 3 apples on the table), or they can be used for ordering (this is...
In various branches of mathematics and computer science, strings are sequences of various simple objects (symbols, tokens, characters, etc. ...
The set of all strings over Σ of length n is denoted Σn. For example, if Σ = {0, 1}, then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ. The set of all strings over Σ of any length is the Kleene closure of Σ and is denoted Σ*. In terms of Σn, . For example, if Σ = {0, 1}, Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, …}. Although Σ* itself is countably infinite, all elements of Σ* have finite length. In mathematical logic and computer science, the Kleene star (or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. ...
In mathematics the term countable set is used to describe the size of a set, e. ...
A set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. For example, if Σ = {0, 1}, the set of strings with an even number of zeros ({ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111, …}) is a formal language over Σ*. A is a subset of B, and B is a superset of A. In mathematics, especially in set theory, the terms, subset, superset and proper (or strict) subset or superset are used to describe the relation, called inclusion, of one set being contained inside another set. ...
In mathematics, logic, and computer science, a formal language is a set of finite-length words (i. ...
Concatenation and substrings Concatenation is an important binary operation on Σ*. For any two strings s and t in Σ*, their concatenation is defined as the sequence of characters in s followed by the sequence of characters in t, and is denoted st. For example, if Σ = {a, b, …, z}, s = bear, and t = hug, then st = bearhug and ts = hugbear. Concatenation is a standard operation in computer programming languages (a subset of formal language theory). ...
In mathematics, a binary operation is a calculation involving two input quantities, in other words, an operation whose arity is two. ...
String concatenation is an associative, but non-commutative operation. The empty string serves as the identity element; for any string s, εs = sε = s. Therefore, the set Σ* and the concatenation operation form a monoid, the free monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative integers. In mathematics, associativity is a property that a binary operation can have. ...
In mathematics, especially abstract algebra, a binary operation * on a set S is commutative if x * y = y * x for all x and y in S. Otherwise * is noncommutative. ...
In mathematics, an identity element (or neutral element) is a special type of element of a set with respect to a binary operation on that set. ...
In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single, associative binary operation and an identity element. ...
In abstract algebra, the free monoid on a set A is the monoid whose elements are all the finite sequences (or strings) of zero or more elements from A, with the binary operation of concatenation. ...
In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single, associative binary operation and an identity element. ...
A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The relation "is a substring of" defines a partial order on Σ*, the least element of which is the empty string. A substring of a string is a string such that . ...
In mathematics, a binary relation (or a dyadic relation) is an arbitrary association of elements of one set with elements of another (perhaps the same) set. ...
In mathematics, a partially ordered set (or poset for short) is a set equipped with a special binary relation which formalizes the intuitive concept of an ordering. ...
In mathematics, especially in order theory, the greatest element of a subset S of a partially ordered set is an element of S which is greater than or equal to any other element of S. The term least element is defined dually. ...
Lexicographical ordering It is often necessary to define an ordering on the set of strings. If the alphabet Σ has a total order (cf. alphabetical order) one can define a total order on Σ* called lexicographical order. Note that since Σ is finite, it is always possible to define a well ordering on Σ and thus on Σ*. For example, if Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 … Order theory is a branch of mathematics that studies various kinds of binary relations that capture the intuitive notion of a mathematical ordering. ...
In mathematics, a total order, linear order or simple order on a set X is any binary relation on X that is antisymmetric, transitive, and total. ...
This article needs cleanup. ...
In mathematics, a total order, linear order or simple order on a set X is any binary relation on X that is antisymmetric, transitive, and total. ...
In mathematics, the lexicographical order, or dictionary order, is a natural order structure of the cartesian product of two ordered sets. ...
String operations A number of additional operations on strings commonly occur in the formal theory. These are given in the article on string operations. In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used on computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. ...
String datatypes A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types. The syntax of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal. In computer science, a datatype or data type (often simply a type) is a name or label for a set of values and some operations which one can perform on that set of values. ...
A programming language is an artificial language that can be used to control the behavior of a machine, particularly a computer. ...
In computer science, primitive types â as distinct from composite types â are data types provided by a programming language as basic building blocks. ...
In computer science, composite types are datatypes which can be constructed in a programming language out of that languages primitive types and other composite types. ...
For other uses, see Syntax (disambiguation). ...
A string literal is the representation of a string value within the source code of a computer program. ...
String length Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed length strings which have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not, and variable length strings whose length is not arbitrarily fixed and which use varying amounts of memory depending on their actual size. Most strings in modern programming languages are variable length strings. Despite the name, even variable length strings are limited in length; although, generally, the limit depends only on the amount of memory available.. Other listings of programming languages are: Categorical list of programming languages Generational list of programming languages Chronological list of programming languages Note: Esoteric programming languages have been moved to the separate List of esoteric programming languages. ...
The terms storage (U.K.) or memory (U.S.) refer to the parts of a digital computer that retain physical state (data) for some interval of time, possibly even after electrical power to the computer is turned off. ...
Character encoding Historically, string datatypes allocated one byte per character, and although the exact character set varied by region, character encodings were similar enough that programmers could generally get away with ignoring this — groups of character sets used by the same system in different regions either had a character in the same place, or did not have it at all. These character sets were typically based on ASCII or EBCDIC. In computer science a byte is a unit of measurement of information storage, most often consisting of eight bits. ...
A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers...
There are 95 printable ASCII characters, numbered 32 to 126. ...
EBCDIC (Extended Binary Coded Decimal Interchange Code) is an 8-bit character encoding (code page) used on IBM mainframe operating systems, like z/OS, OS/390, VM and VSE, as well as IBM minicomputer operating systems like OS/400 and i5/OS. It is also employed on various non-IBM...
Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256 characters (the limit of a one-byte-per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the ASCII range will only represent that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching on byte codes unsafe. Another issue is that if the beginning of a string is deleted, important instructions for the decoder or information on position in a multibyte sequence may be lost. Another is that if strings are joined together (especially after having their ends truncated by code not aware of the encoding), the first string may not leave the encoder in a state suitable for dealing with the second string. A Chinese logogram, which is also an ideogram. ...
CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ...
In computer science a byte is a unit of measurement of information storage, most often consisting of eight bits. ...
There are 95 printable ASCII characters, numbered 32 to 126. ...
A Chinese character. ...
Extended Unix Coding Equipment under Control IEC 61508 ...
ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding. ...
The title given to this article is incorrect due to technical limitations. ...
Unicode has complicated the picture somewhat. Most languages have a datatype for Unicode strings (usually UTF-16 as it was usually added before Unicode supplemental planes were introduced). Converting between Unicode and local encodings requires an understanding of the local encoding, which may be problematic for existing systems where strings of various encodings are being transmitted together with no real marking as to what encoding they are in. Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
Implementations Some languages like C++ implement strings as templates that can be used with any primitive type, but this is the exception, not the rule. C++ (pronounced see plus plus, IPA: ) is a general-purpose, high-level programming language with low-level facilities. ...
Generic programming is a style of computer programming where algorithms are written in an extended grammar and are made adaptable by specifying variable parts that are then somehow instantiated later by the compiler with respect to the base grammar. ...
In computer science, primitive types â as distinct from composite types â are data types provided by a programming language as basic building blocks. ...
If an object-oriented language represents strings as objects, they are called mutable if the value can change at runtime and immutable if the value is frozen after creation. For example, Ruby has mutable strings, while Python's strings are immutable. Ruby is a reflective, dynamic, object-oriented programming language. ...
Python is a high-level programming language first released by Guido van Rossum in 1991. ...
Representations Representations of strings depend heavily on the choice of character repertoire and the method of character encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16. There are 95 printable ASCII characters, numbered 32 to 126. ...
ISO 8859, more formally ISO/IEC 8859, is a joint ISO and IEC standard for 8-bit character encodings for use by computers. ...
Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ...
Most string implementations are very similar to variable-length arrays with the entries storing the character codes of corresponding characters. The principal difference is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8, where single characters can take anywhere from one to four bytes. In these cases, the logical length of the string differs from the logical length of the array. This article or section does not cite its references or sources. ...
A character encoding is a code that pairs a set of characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. ...
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...
The length of a string can be stored implicitly by using a special terminating character; often this is the null character having value zero, a convention used and perpetuated by the popular C programming language. Hence, this representation is commonly referred to as C string. The length of a string can also be stored explicitly, for example by prefixing the string with byte value — a convention used in Pascal; consequently some people call it a P-string. Thenull character (also null terminator) is a character with the value zero, present in the ASCII and Unicode character sets, and available in nearly all mainstream programming languages. ...
C is a general-purpose, procedural, imperative computer programming language developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system. ...
In computing, C strings are character sequences stored as one-dimensional character arrays and terminated with a null character (0). The name refers to the ubiquitous C programming language using this string representation, and is used elsewhere to distinguish this often-used representation from others. ...
In computer science a byte is a unit of measurement of information storage, most often consisting of eight bits. ...
Pascal is an imperative computer programming language, developed in 1970 by Niklaus Wirth as a language particularly suitable for structured programming. ...
In terminated strings, the terminating code is not an allowable character in any string. Here is an example of a null-terminated string stored in a 10-byte buffer, along with its ASCII representation: In computing, a buffer is a region of memory used to temporarily hold output or input data, comparable to buffers in telecommunication. ...
| F | R | A | N | K | NUL | k | e | f | w | | 46 | 52 | 41 | 4E | 4B | 00 | 6B | 65 | 66 | 77 | The length of a string in the above example is 5 characters, but it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.) An assembly language is a low-level language used in the writing of computer programs. ...
Here is the equivalent (old style) Pascal string stored in a 10-byte buffer, along with its ASCII representation: | length | F | R | A | N | K | k | f | f | w | | 05 | 46 | 52 | 41 | 4E | 4B | 6B | 66 | 66 | 77 | While these representations are common, others are possible. Using ropes makes certain string operations, such as insertions, deletions, and concatenations more efficient. The concept of a heavyweight string, called a rope, involving the use of a concatenation tree representation, was introduced in a paper called Ropes: an Alternative to Strings. http://www. ...
Vectors While character strings are very common uses of strings, a string in computer science may refer generically to any vector of homogenously typed data. A string of bits or bytes, for example, may be used to represent data retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used.
String processing algorithms There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms include In mathematics, computing, linguistics, and related disciplines, an algorithm is a finite list of well-defined instructions for accomplishing some task that, given an initial state, will terminate in a defined end-state. ...
Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and finite state machines. String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. ...
This article presents string functions given in computer programming. ...
In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a list in a certain order. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
A parser is a computer program or a component of a program that analyses the grammatical structure of an input, with respect to a given formal grammar, a process known as parsing. ...
Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. ...
Fig. ...
Character string oriented languages and utilities Character strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include the following languages: Many UNIX utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings. AWK is a general purpose computer language that is designed for processing text-based data, either in files or data streams. ...
Icon is a very high-level programming language featuring goal directed execution and excellent facilities for managing strings and textual patterns. ...
Perl is a dynamic programming language created by Larry Wall and first released in 1987. ...
REXX (REstructured eXtended eXecutor) is an interpreted programming language which was developed at IBM. It is a structured high-level programming language which was designed to be both easy to learn and easy to read. ...
Ruby is a reflective, object-oriented programming language. ...
The correct title of this article is . ...
SNOBOL (StriNg Oriented symBOlic Language) is a computer programming language developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky. ...
Tcl (originally from Tool Command Language, but nonetheless conventionally rendered as Tcl rather than TCL; and pronounced tickle) is a scripting language created by John Ousterhout. ...
Filiation of Unix and Unix-like systems Unix (officially trademarked as UNIX®) is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie and Douglas McIlroy. ...
Several string libraries for the C and C++ programming languages do exist which add greater functionality for string processing in those languages: Some APIs like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be interpreted. API may refer to: In computing, application programming interface In petroleum industry, American Petroleum Institute In education, Academic Performance Index This page concerning a three-letter acronym or abbreviation is a disambiguation page â a navigational aid which lists other pages that might otherwise share the same title. ...
The Media Control Interface, MCI in short, in an aging API for controlling multimedia peripherals connected to a Microsoft Windows or OS/2 computer. ...
Embedded SQL is a method of combining the computing power of a programming language (like C/C++, Pascal, etc. ...
Several programming languages implement a printf function, to output a formatted string. ...
Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to facilitate text operations. Scripting programming languages (commonly called scripting languages or script languages) are computer programming languages designed for scripting the operation of a computer. ...
Perl is a dynamic programming language created by Larry Wall and first released in 1987. ...
Python is a high-level programming language first released by Guido van Rossum in 1991. ...
Ruby is a reflective, object-oriented programming language. ...
Tcl (originally from Tool Command Language, but nonetheless conventionally rendered as Tcl rather than TCL; and pronounced tickle) is a scripting language created by John Ousterhout. ...
In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules. ...
Character string functions -
String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query information about a string. They are usually used within the context of a computer programming language. String functions are used to manipulate a string or change or edit the contents of a string. ...
String functions are used to manipulate a string or change or edit the contents of a string. ...
A programming language is an artificial language that can be used to control the behavior of a machine, particularly a computer. ...
The most basic example of a string function is the length(string) function. This function returns the length of a string (not counting the null terminator or any other of the string's internal structural information) and does not change the string. eg. length("hello world") would return 11. There are many string functions which exist in other languages with similar or exactly the same syntax or parameters. For example in many languages the length function is usually represented as len(string). Even though string functions are very useful to a computer programmer, a computer programmer using these functions should be mindful that a string function in one language could in another language behave differently or have a similar or completely different function name, parameters, syntax and outcomes.
See also |