C++ 在C++;20允许在标准c和x2B之间进行“重新解释”+;unicode字符和下划线类型?

C++ 在C++;20允许在标准c和x2B之间进行“重新解释”+;unicode字符和下划线类型?,c++,undefined-behavior,strict-aliasing,c++20,C++,Undefined Behavior,Strict Aliasing,C++20,C++20的严格的别名规则是否允许任意执行以下操作 在char*和char8\u t* string str=“string”; u8string u8str{(char8_t*)和*str.data()};//c++20字符串 u8字符串u8str2=u8“zß水 类型的char*\t行没有任何特殊的别名规则。因此,.和这些规则对于基础类型之间的转换没有例外 因此,您所做的大部分工作都是UB。由于其特殊性质,不属于UB的一种情况是char。实际上,您可以将char8\t的字节作为char的数组

C++20
严格的别名规则是否允许任意执行以下操作

  • char*
    char8\u t*
  • string str=“string”;
    u8string u8str{(char8_t*)和*str.data()};//c++20字符串
    
    u8字符串u8str2=u8“zß水 类型的
    char*\t
    行没有任何特殊的别名规则。因此,.和这些规则对于基础类型之间的转换没有例外

    因此,您所做的大部分工作都是UB。由于其特殊性质,不属于UB的一种情况是
    char
    。实际上,您可以将
    char8\t
    的字节作为
    char
    的数组来读取。但您不能做相反的事情,将
    char
    数组的字节作为
    char8\t
    来读取

    现在,这些类型完全可以相互转换。因此,您可以随时将这些数组中的值转换为其他类型


    话虽如此,在实际实现中,这些东西几乎肯定会起作用。好吧,除非它们不起作用,因为你试图通过一个不应该被改变的东西来改变一个东西,而编译器不会重新加载被改变的值,因为它假设它不可能被改变。所以,实际上,只要使用正确的,meaningful类型。

    C样式转换与重新解释转换不同

    我认为标准部分与您的问题相关:

    :类型char8-t表示基础类型为无符号字符的不同类型。类型char16-t和char32-t表示基础类型分别为uint-least16-t和uint-least32-t的不同类型

    :如果程序试图通过类型与以下类型之一不相似([conv.qual])的glvalue访问对象的存储值,则行为未定义:

    一,。 对象的动态类型

    二,。 与对象的动态类型相对应的有符号或无符号类型,或

    三,。 字符、无符号字符或std::字节类型

  • char8\u t*-->char*
    是。
    因为
    char
    是 所有对象都可以转换为。但标准不保证(取消引用)转换后的值对于不同的类型是相等的。
    char
    可以是有符号的,也可以是无符号的。
    char8\u t*-->无符号的char*
    是有效的,但不应该保证这一点,因为它仍然是不同的。但是考虑到它是
    char8\u t
    的基础类型,我猜它应该是吗
  • char*-->char8\u t*

    根据6.7.1.9,这些类型是不同的。尽管可能会有这样的论点:“其基础类型是无符号字符“第7.2.1.11.3节中明确允许使用
    无符号字符
    ,但我不认为这是正确的解释,区别应该是决定因素。提案中的以下评论引述了这一点(我没有发现更近期的修订):

    最后,由于char类型的glvalue表达式可能会混淆其他类型的对象,UTF-8字符串的处理目前受到优化悲观化的影响。使用不共享此别名行为的不同类型可能允许进一步优化编译器

  • uint32\t*char32\t*
    uint16\t*char16\t*
    uint16\t*uint16\t*
    uint32\t*uint32\t*,
    uint32\t\t\t\t\t\t\t\t\t\t\t\t
    这些对都是不同的,因此7.2.1.11.1不适用,并且两种类型都不在7.2.1.11.3中,因此2的第二部分也不适用。可能是相关的

  • 无符号字符*-->char8\u t*

    根据与第2条相同的论点。显然,这不是允许的
    T*->T*
    cast

  • char8\u t*-->无符号字符*

    因为
    unsigned char
    是7.2.1.11.3中允许的类型之一。但我仍然认为,该标准不能保证(取消引用的)转换值相等。但考虑到它是char8_t的基本类型,我想除了平等之外,它没有其他选择了


  • 正如我们在同一页上一样,
    (T*)expression
    的C样式强制转换相当于
    重新解释强制转换(expression)
    (),这相当于
    静态强制转换(静态强制转换(expression))
    ()。这对指针的值没有任何影响,因为它们不是指针可相互转换的。(见和)

    因此,是的,我们必须查看它是否可以被别名。引用的类型必须类似于:

    • 对象的动态类型
    • 与对象的动态类型相对应的有符号或无符号类型,或
    • char
      无符号char
      std::byte
      类型
    事实并非如此。尽管
    char8\u t
    的底层类型为
    unsigned char
    ,但它不是类似的类型

    例如:

    无符号字符uc='a';
    //代表uc的地址
    无符号字符*uc_ptr=&uc;
    //仍然保留着加州大学的地址,而不是一个字符
    char8_t*c8_ptr=重新解释铸件(uc_ptr);
    char8_t c8=*c8_ptr;//UB,因为'char8\u t'不是'cv unsigned char'。
    
    虽然是因为,上面说:

    指定有符号或无符号整数类型作为其基础类型的基本类型具有相同的对象表示形式[…]

    您可以执行
    reinterpret\u cast(pointer-to-char8\u t)
    并使所有值相等,但这是唯一的情况(一个
    string str = "string";
    u8string u8str { (char8_t*) &*str.data() }; // c++20 u8string
    
    u8string u8str2 = u8"zß水The 
    char*_t
    line of types do not have any special aliasing rules. Therefore, the standard rules apply. And those rules do not have exceptions for conversion between underlying types.

    So most of what you did is UB. The one case that isn't UB is
    char
    due to its special nature. You can in fact read the bytes of a
    char8_t
    as an array of
    char
    . But you can't do the opposite, reading the bytes of a
    char
    array as
    char8_t
    .

    Now, these types are completely convertible to each other. So you can convert the values in those array to the other type anytime you want.

    All that being said, on real implementations those things will almost certainly work. Well, until they don't, because you tried to change one thing through a thing that it's not supposed to be changed by, and the compiler doesn't reload the changed value because it assumed that it couldn't have been changed. So really, just use the correct, meaningful type.

    C-style cast is not the same thing as
    reinterpret_cast
    .

    The standard sections I think are relevant to your question:

    6.7.1.9: Type char8_­t denotes a distinct type whose underlying type is unsigned char. Types char16_­t and char32_­t denote distinct types whose underlying types are uint_­least16_­t and uint_­least32_­t, respectively, in .

    7.2.1.11: If a program attempts to access the stored value of an object through a glvalue whose type is not similar ([conv.qual]) to one of the following types the behavior is undefined:

    1. the dynamic type of the object,

    2. a type that is the signed or unsigned type corresponding to the dynamic type of the object, or

    3. a char, unsigned char, or std::byte type.

    1. char8_t*-->char*
      Yes.
      Because
      char
      is one of the types that all objects can be converted to. But the standard does not guarantee that the (dereferenced) converted values are equal for distinct types.
      char
      can be signed or not and
      char8_t
      is unsigned.
      char8_t*-->unsigned char*
      is valid but should not guarantee that either because it's still distinct. But given that it's
      char8_t
      's underlying type it should be, I guess?
    2. char*-->char8_t*
      No.
      As per 6.7.1.9 those types are distinct. Although there might be argument made that "whose underlying type is unsigned char" part could apply with
      unsigned char
      being explicitly allowed in 7.2.1.11.3 but I don't think that would be the correct interpretation and being distinct should be the deciding factor. That is supported by the following quote of a comment in the proposal P0482R6 - char8_t: A type for UTF-8 characters and strings (Revision 6 - 2018-11-09) (I did not find more recent revision):

      Finally, processing of UTF-8 strings is currently subject to an optimization pessimization due to glvalue expressions of type char potentially aliasing objects of other types. Use of a distinct type that does not share this aliasing behavior may allow for further compiler optimizations.

    3. uint32_t*<-->char32_t*
      ,
      uint16_t*<-->char16_t*
      ,
      uint16_t*<-->uint_least16_t*
      ,
      uint32_t*<-->uint_least32_t*
      ,
      uint_least32_t<-->char32_t
      ,
      uint_least16_t<-->char16_t
      : No.
      Those pairs are all distinct, so 7.2.1.11.1 does not apply and neither type is in 7.2.1.11.3 so not even the second part of 2. can be relevant.

    4. unsigned char*-->char8_t*
      No.
      By the same argument as in 2. It's not
      T*->T*
      cast which is obviously allowed.

    5. char8_t*-->unsigned char*
      Yes.
      Because
      unsigned char
      is too one of the allowed types per 7.2.1.11.3 . But I would still argue that the standard does not guarantee that the (dereferenced) converted values will equal. But given that it's char8_t's underlying type it doesn't have any other options other than to be equal, I guess?

    Just so we are on the same page, the C-style casts of
    (T*) expression
    are equivalent to
    reinterpret_cast<T*>(expression)
    ([expr.cast]/4.4), which is equivalent to
    static_cast<T*>(static_cast<void*>(expression))
    ([expr.reinterpret.cast]/7). This does nothing to the value of the pointer, as they are not pointer-interconvertible. (See [expr.static.cast]/13 and [basic.compound]/4).

    So yes, we would have to look at [basic.lval]/11 to see if it can be aliased. The reference must have a type which is similar to:

    • the dynamic type of the object,
    • a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
    • a
      char
      ,
      unsigned char
      , or
      std::byte
      type.
    Which is not the case. Even though
    char8_t
    has the underlying type of
    unsigned char
    , it is not a similar type.

    So, for example:

    unsigned char uc = 'a';
    
    // Represents address of uc
    unsigned char* uc_ptr = &uc;
    
    // Still holds the address of uc, not a char8_t
    char8_t* c8_ptr = reinterpret_cast<char8_t*>(uc_ptr);
    
    char8_t c8 = *c8_ptr;  // UB, as `char8_t` is not `cv unsigned char`.