Reading Note of CppPrimer-Chapter3

发表于 2022-05-27 更新于 2024-08-06 分类于 C++ 阅读次数：本文字数： 16k 阅读时长 ≈ 14 分钟

String & Vector & Array

String & String Literal

Typical memory layout of std::string
Most modern std::string implementations save very small strings directly on the stack in a statically sized char array instead of using dynamic heap storage. This is known as Small (or Short) String Optimisation (SSO). It allows implementations to avoid heap allocations for small string objects and improves locality of reference.

Furthermore, there will be a std::size_t member to save the strings size and a pointer to the actual char storage.

How this is specifically implemented differs but something along the following lines works:
1
2
3
4
5
6
7
8
9
template <typename T>
struct basic_string {
char* begin_;
size_t size_;
union {
size_t capacity_;
char sso_buffer[16];
};
};

std::string::size 类型是 std::string::size_type, 实为 unsigned，所以不要将 size_t 与 signed int 混用；

std::string str;
// when unsigned do algorithmic operation with signed type, signed will convert to an unsigned one
// equal to str.size() < 4294967295
str.size() < -1;    //会永远返回true，

terminal string 输入

一次输入一行用 getline(in_stream, string)，遇到换行返回，string 不包含换行
一次输入一个字符串，空格分开，可以使用用 std::cin >> str，string 不包含空格 ;
终止条件都是遇到 EOF(windows ctrl+z, linux ctrl+d)

// case 1: getline
std::string line;
// loop until encounter EOF
while(getline(std::cin, line)){
    // read one line each time
}

// case 2: >> operator
std::string str;
// input: ab 123 sdas
// get 3 strings, separated by blankspace
while (std::cin >> str){
    // read one string each time
}

string 比较：字典序 (lexicographical order)
1. 两个 string 长度不同，但是对应位置上的字符相同，则较短的 string < 较长的 string
2. 两个 string 某些位置上的字符不同，此时 string 的比较结果为第一对相异字符比较的结果

string 的内部类型为 char，可以理解为类似于 std::vector<char>

std::string s = "string";
// type of c is char
for (auto& c : s){        
    std::cout << c << std::endl;
}

advantages of std::string over std::vector<char>
- C++11 introduced the requirement that a std::string is required to store a NULL-terminated sequence of characters internally. stackoverflow. That brings it into compliance and makes interoperating with C-style strings easier. Obviously, std::vector<char> would not have that requirement associated with it, and you wouldn’t want it to.
- Common implementations of std::string will use an optimization called the “small string optimization (SSO)”, which avoids dynamic memory allocation when you are storing a string that will fit directly within the std::string object instance. You won’t find this optimization in std::vector<> (although it could actually be implemented in a custom vector type).
- std::string offers a very different and much expanded interface compared to std::vector<>. While the latter is just a boring old sequence of elements, the former is actually designed to represent a string and therefore offers an assortment of string-related convenience functions.

useful char utils

#include <cctype>
isalnum();
isalpha();
isdigit();
isspace();        // blankspace, \t,\n,\r 
isupper();

string literal 和 string 是两种类型

For historical reasons, and for compatibility with C, string literals are not standard library strings

String literal ： "s-char-sequence(optional)"，ordinary string literal.

The type of an unprefixed string literal is const char[N], where N is the size of the string in code units of the execution narrow encoding (until C++23)ordinary literal encoding (since C++23), including the null terminator.

Concatenation: String literals placed side-by-side are concatenated at translation phase 6 (after the preprocessor). That is, "Hello," " world!" yields the (single) string "Hello, world!".

The null character (‘\0‘, L’\0‘, char16_t(), etc) is always appended to the string literal: thus, a string literal "Hello" is a const char[6] holding the characters 'H', 'e', 'l', 'l', 'o', and '\0'.

String literals have static storage duration, and thus exist in memory for the life of the program.

String literals can be used to initialize character arrays. If an array is initialized like char str[] = "foo", str will contain a copy of the string "foo".

std::string operator + 两侧的数据必须至少有一个是 std::string，比如：

std::string name = "Pang" + "Wong";  // error: invalid operands to binary expression 
                                     //        ('const char [5]' and 'const char [5]')
std::string name = "pang" "wong";    // valid, concatenate 2 litertal strings by put them side by side

// valid
std::string last_name = "Wong";
std::string full_name = "Pang" + last_name;

C 风格字符串 (C-Style String, a kind of char array terminate with '\0')

C 风格字符串不是一种类型，而是一种约定俗成的写法，即将字符串存放在字符数组中，并以空字符结束 (null terminated，'\0')

字符数组 char [] 有另外一种初始化方式，即 init from literal string。使用这种初始化的时候，literal string 会自动补上'\0', 因为 string literal 自己本身就会自动补上空字符；

// init from init-list
char a[4] = {'a', 'b', 'c', '\0'};

// init from literal string
char b[] = “PangWong”;     // 此时数组b的size为9，因为字符串结束的地方有空白字符也拷贝进了数组；
char c[3] = "123";         // invalid, size can not fit

任何出现 string literal 的地方，都可以使用 C 风格字符串代替

允许使用 C 风格字符串来初始化 string 对象以及对 string 对象赋值
在 string 对象的加法运算中允许使用 C 风格字符串作为其中的一个运算对象；

char c_style_str[] = "1234";
std::string s1(c_style_str);          // init std::string from c-style string
std::string s2 = c_style_str;         // assign c-style string to std::string
std::string s3 = s1 + c_style_str;    // c-style string is operand of std::string operator+

str_len vs sizeof

sizeof 是一个 C 语言中的一个单目运算符，而 strlen 是一个函数
sizeof 求的是数据类型所占空间的大小，而 strlen 是求字符串的长度
C 风格字符串必须以空字符'\0' 结束，否则 strlen 等操作均会出现意想不到的问题

比如遍历字符串的时候，会到从当前内存开始，一直到出现第一个空字符为止；结果是未定义的；
sizeof 属于不求值操作符 (unevaluated expressions)，只是编译期的常量 (only query the compile-time properties of their operands)，并不会对参数进行编译甚至求值

// init from string literal
char *c="abcdef";
char d[]="abcdef";
// init from init-list
char e[]={'a','b','c','d','e','f'};

// c为一个字符指针变量，指向常量字符串. sizeof(c)求的是类型空间大小，在前面说过，指针型所点的空间大小是4个字节(32位)
// c以字符串赋值时，"abcdef",结尾自动加一个"/0". strlen(c)遇到/0就会结束，strlen求的是字符串的长度，为6.
printf("%d%d/n",sizeof(c),strlen(c));      // 4(8) 6

// strlen(d)也是一样，字符串赋值，自动添加/0,求字符串的长度当然是6.
// sizeof(d)是求这个数组所占空间的大小，即数组所占内存空间的字节数，应该为7.
printf("%d%d/n",sizeof(d),strlen(d));      // 7 6

// sizeof(e), 数组e以单个元素赋值，没有/0结束符，所以所占空间的大小为6个字节。
// strlen(e),去找/0结尾的字符串的长度，由于找不到/0，所以返回的值是一个不确定的值。
printf("%d%d/n",sizeof(e),strlen(e));      // 6 (6)(undefined, stop when encounter \0)

char f[] = "he\0llo\0";
sizeof(f);    // 7
strlen(f);    // 2, terminate at the first '\0'

printf("char=%d/n",sizeof(char));          //1
printf("char*=%d/n",sizeof(char*));        //4
printf("int=%d/n",sizeof(int));            //4
printf("int*=%d/n",sizeof(int*));          //4
printf("long=%d/n",sizeof(long));          //4
printf("long*=%d/n",sizeof(long*));        //4
printf("double=%d/n",sizeof(double));      //8
printf("double*=%d/n",sizeof(double*));    //4
printf("string=%d/n",sizeof(std::string));          //24
printf("vector=%d/n",sizeof(std::vector<int>));     //24
printf("vector=%d/n",sizeof(std::vector<bool>));    //24
printf("vector=%d/n",sizeof(std::vector<long>));    //24
printf("sizeof(std::map<int, int>%d/n", sizeof(std::map<int, int>);              // 24
printf("sizeof(std::unordered_map<int, int>%d/n", std::unordered_map<int, int>); // 40

不可直接比较两个 C 风格的字符串，他们本质上是两个不相关的 const char * 指针，会得出 UB 的结果；

cstring 头文件中提供了 C 语言风格字符串的函数

#include <cstring>
char p1[] = "PangWong";
char p2[] = "123";    
strlen(p1);            // length of p1, terminate with '\0', not included
strcmp(p1, p2);        // 0: equal, positive: bigger, negative: smaller
strcat(p1, p2);        // cat p2 at the tail of p1, return p1
strcpy(p1, p2);        // copy from p2 to p1, return p1

Vector & Iterator

std::vector 是支持比较的，比较的规则和 std::string 类型（字典序），前提是 vector 内的类型是可比较的
1
2
std::vector<int> vec(2, 0), vec1(2, 1);
vec < vec1; // true, {0,0} < {1,1}

std::vector<bool> 是 std::vector 的特例化，不是简单的模板实例化

主要区别在于 std::vector::iterator 的解引用返回的是一个 proxy reference 的 temporary object

Exposes class std::vector::reference as a method of accessing individual bits. In particular, objects of this class are returned by operator[] by value.
1
2
3
4
5
6
7
8
std::vector<bool> v{false, false, false};
for (auto x : v) {            // x is reference
    x = true; // Changes the element inside v!
}
std::vector<int> v1{1,2,3};
for (auto x: v1){
    x = 10;   // NOT Change the element in v1
}

确保下标合法的一种有效手段是尽可能使用 range for 语句，下标运算符[] 只可用于访问已存在的数据，不能添加数据
1
2
3
for (auto v : vec){
// print v
}
如果容器为空，则 std::begin () 和 std::end 都返回的是 container.end ()，即尾后迭代器

可以对迭代器解引用，类似于指针解引用；但 container.end()，因为不指向任何具体元素，不可以递增和解引用
1
2
3
4
std::vector<int> vec;
assert(vec.begin() == vec.end()); // true
vec.push_back(1);
assert(*vec.begin() == 1); // true

使用迭代器的时候判断边界条件，习惯使用 !=，因为对于所有的标准容器库的迭代器类型，肯定都定义了 != 操作符，但大多没有定义 < 操作符；

1
2
3

for (auto it=vec.begin(); it != vec.end(); it++){
    // do something
}

但是某些情况使用 != 会导致访问越界，比如在迭代的过程中改变了容器的大小或者对迭代器做了较大的增减；

// segment fault， because a.end()+1 != a.end() but a.end() + 1 is invalid iterator
std::vector<int> a = {1,2,3};
for (auto it=a.begin(); it!=a.end(); it++){
    std::cout << "------------------" << std::endl;
    std::cout << "before erase" << std::endl;
    std::cout << "it->" << *it << std::endl;
    std::cout << a.end() - it << std::endl;

    // erase
    it = a.erase(it);    // now iter=iter+1, iter++ make iter=iter+2;
  
    std::cout << "after erase" << std::endl;
    std::cout << "it->" << *it << std::endl;
    std::cout << a.end() - it << std::endl;
}
//[1]    67730 segmentation fault  ./test

凡是使用了迭代器的循环，在循环中轻易都不要做增加或删除迭代器所属容器大小的操作；迭代器可能会失效；

// valid
std::vector<int>a = {1,2,3,4,5,6,7};
for (auto it=a.begin(); it!=a.end(); it++){
    std::cout << "it->" << *it << std::endl;
}
//it->1
//it->2
//it->3

// segment fault or normal terminate according to a.size()
for (auto it=a.begin(); it!=a.end(); it++){
    std::cout << "it->" << *it << std::endl;
    // delete item
    a.erase(it);
}
//it->1
//it->3
//it->3


// invalid, dead loop
for (auto it=a.begin(); it!=a.end(); it++){
    std::cout << "it->" << *it << std::endl;
    // append item
    a.push_back(10);
}
//it->1
//it->46241    # undefined
//it->2043    # undefined
//it->0
//it->0
//it->0
//...

std::vector 的迭代器失效 (Iterator invalidation) 问题

Operations	Invalidated
All read only operations	Never
swap, std::swap	end() Every iterator referring to an element in one container before the swap shall refer to the same element in the other container after the swap. The `end()` iterator does not refer to any element, so it may be invalidated. ref: stackoverflow
clear, operator=, assign	All iterators, pointers and references to the elements of the container are invalidated. The past-the-end iterator is also invalidated.
reserve, shrink_to_fit	If the vector changed capacity, all of them. If not, none.
erase	Erased elements and all elements after them (including end())
push_back, emplace_back	If the vector changed capacity, all of them. If not, only end().
insert, emplace	If the vector changed capacity, all of them. If not, only those at or after the insertion point (including end()).
resize	If the vector changed capacity, all of them. If not, only end() and any elements erased.
pop_back	The element erased and end().

容器的迭代器的差值为 difference_type，可正可负，两个迭代器之间的操作只有'-'，没有'+'；

std::vector<int> v = {1,2,3,4};
std::vector<int>::difference_type diff = v.end()-v.begin();        // 4

// invalid, error: invalid operands to binary expression  ('std::vector<int>::iterator' (aka 	
//                 '__bit_iterator<std::vector<int>, false>') and 'std::vector<int>::iterator' 
auto sum = v.end() + v.begin();

Array & Multiple-Dimension Array

数组的 size 一般是固定的，所以编译的时候数组维度应该是 constexpr 表达式；

使用 initializal-list 初始化，数组的维度可以在初始化时推断 (deduce) 出来；

constexpr int size_c = 10;
int array[size_c];        // valid

const int size_nc = 10;
int array[size_nc];       // invalid

int array[] = {1,2,3,4,5};      // size of array can be deducted as 5

复合类型是指基于其他类型定义的类型

引用、指针、数组、容器

数组的元素应该为对象，不存在引用的数组，但是存在数组的引用

int arr[] = {1,2,3,4,5,6,7,8,9,10};
int& array[10];                 // error: 'array' declared as array of references of type 'int &'
int (&array)[10] = arr;         // valid, array is a reference to arr

typedef int array_type[10];
array_type& array = arr;        // valid, array is a reference to arr

数组的初始化的一些方法：

可以使用初始化列表，此时可以不指定数组维度，编译器可以自己推断；
指定数组维度的时候，如果初始值小于维度，则用初始值初始化前面的元素，剩余维度执行零认初始化
不可以使用一个数组给另一个数组赋值，也不允许直接拷贝拷贝一个数组作为另一个数组的初始值
初始化多维数组的时候既可以使用嵌套的 initializal-list，也可以使用 1-row 的 initialzal-list 来直接初始化，因为本质上多维数组也是线性存储的；
类似于一位数组，指定数组维度的时候，每个数组都可以只初始化一部分，其他部分执行零初始化；

int array[100];             // 默认初始化，未定义
int array[100] = {};        // 值初始化，都初始化为0
int array[100] = {0};       // 将所有的值都设为0(第一个手动设为0，其他的默认设为0)；
int array[100] = {-1};      // 只将数组第一个值设为-1，其他设为0
int array[100] = {-1,-2,-3}
std::fill_n(array, 100, DEFAULT_VALUE);    // #include <algorithm>

int a[] = {0,1,2};          // valid, array size can be deducted
int a1[] = a;               // invalid, array can not be assigned to another array directly


int a[2][2] = {             // 多维数组列表初始化
        {1,2},
        {3,4}
};
int b[2][2] = {1,2,3,4};    // 等价于上面的
int c[2][2] = {             // 初始化每行的首元素
      {7},
      {8}
}
int d[2][2] = {1,2};        // 初始化第一行

在大多数表达式里面，使用数组类型的对象，实际是使用了一个指向该数组首元素的指针；

auto 会自动把数组转成指针，但是 decltype 不会，结果是带上数组维度和类型的类型，数组的引用也会带上数组维度和类型；

int a[2] = {0,1};
auto b(a);          // b: int*
// equal to
auto b = &(a[0])   
                    // decltype(a): int[2]
decltype(a) c = a;  // invalid: array initializer must be an initializer list
decltype(a) &d = a; // valid for a is reference to array

数组的别名

1 2	using int_array4 = int[4]; typedef int int_array4[4]; // equal to using

理解数组的复合类型

理解一般 const or pointer 的复合类型的时候可以从右往左，理解数组的复合类型时，可以考虑由数组名从内到外

// 复合类型
int i = 42;
int *p = &i;       // p是指向int的指针
int *&r = p;       // r是对指向int的指针的引用，从右往左看，首先r是一个引用，再往左的部分就是引用的类型，可以看出是int*

// 数组
int array[10];     // array是包含10个整数的数组
int *array1[10];   // array1是包含10个整数指针的数组
//int &array2[10]; // invalid, array2是包含10个整数引用的数组，不存在引用的数组
int (*array3)[10] = &array    // array3是一个指向含有10个整数数组的指针
int (&array4)[10] = array;    // array4是对一个含有10和整数数组的引用

// 更复杂的数组
// 由内到外，首先array5是一个引用，观察右边知道引用对象是一个大小为10的数组，观察左边可以知道数组的元素类型是int*，于是
// array5是一个对大小为10的int*数组的引用
int *(&array5)[10] = array1;

std::array 和 array 性能上相似，只是 std::array 在操作上更像标准容器；
1
2
int arr1[10];
std::array<int, 10> arr2;

和迭代器一样，指针也可以算术相减，结果是他们之间的距离。

空指针也可以加上或者剪掉值为 0 的整形常量表达式；

两个空指针也可以相减，结果为 0；

对于数组指针，只有两个指针都指向同一对象内的地址的时候，算术运算才有意义；

#include <cstddef>

int array[] = {1,2,3,4,5};
// array size
ptrdiff_t size = std::end(array) - std::begin(array);

int* p1 = nullptr;
int* p2 = nullptr;
std::cout << "null sub: " << p1-p2 << std::endl;        // null sub: 0
std::cout << "p1+1: " << p1+1 << std::endl;             // p1+1: 0x4
int sz = 1;
std::cout << "p1-sz: " << p1-sz << std::endl;           // p1+sz: 0xfffffffffffffffc

一个指针变量加 / 减一个整数是将该指针变量的原值(是一个地址) 和它指向的变量所占用的内存单元字节数相加或相减。如 p+i 代表这样的地址计算：p+i*d，d 为 p 所指向的变量单元所占用的字节数；

constexpr MAX=3;
int  var[MAX] = {10, 100, 200};
for ( i = 0; i < MAX; i++){
      printf("address ：var[%d] = %x\n", i, ptr );
      printf("value：var[%d] = %d\n", i, *ptr );
      /* 移动到下一个位置 */
      ptr++;
}
//address：var[0] = bf882b30
//value：var[0] = 10
//address：var[1] = bf882b34
//value： var[1] = 100
//address：var[2] = bf882b38
//value：var[2] = 200

多维数组是数组的数组，所以数组名转换来的指针实际上是指向第一个数组的指针；外层数组指针的加 1 其实是将指针向后移动一整个内层数组的大小

int ia[3][4];
for (int_array *p=ia; p!=ia+3; ++p){       // p is pointer to array, p++ point to next inner array
      for (int* q=*p; q<*p+4; q++){
          std::cout << *q << std::endl;
    }
}

数组遍历用 std::begin/std::end 或者 range based for loop 会更安全方便

数组 index 越界并不会出现编译错误和 runtime 抛异常，事实上越界后的内存只要不是操作系统保护的一般都是可以访问的。但是越界后的内存以后可能给其他人用，可能会导致意想不到的结果。正是因为越界不抛异常，所以可以取数组越界后的第一个地址（类似于 vector 的尾后指针）来判断循环结束；不过用 std::begin 和 std::end 会更安全方便；

// std::begin/std::end
int a[3] = {1,2,3};
for (auto it=std::begin(a); it<std::end(a); it++){
    std::cout << *it << std::endl;
}
// range based for loop
for (auto v: a){
    std::cout << v << std::endl;
}

多维数组也可以使用 range for, 需要注意的是，为了避免数组自动转换成指针，需要对 auto 加 & 符号，无论本身是不是需要修改数组的值；即多维数组的多层循环，所有对数组的 auto 都要加上引用符号 &

for(auto& row: array){
    for (auto v: row){}
        // do something
    }
}

标准库的 []operator 要求下标为 unsigned int，而 bulit-in[]operator 的是 signed int，可以是负值，表示基于此向前索引 n 个值；
1
2
3
int ia[] = {1,2,3,4,5,6};
int* p = &ia[4]; // a[4] = 5
std::cout << p[-2] << std::endl; // a[2] = 3;