Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/360.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/unit-testing/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
为什么我所有的解码字符串都有'';最后?Java字符串解码_Java_Python_String_Tweepy_Decoding - Fatal编程技术网

为什么我所有的解码字符串都有'';最后?Java字符串解码

为什么我所有的解码字符串都有'';最后?Java字符串解码,java,python,string,tweepy,decoding,Java,Python,String,Tweepy,Decoding,我正在使用Tweepy库(Python)和Kafka从Twitter检索推文。文本以UTF-8编码,如此行所示: self.producer.send('my-topic', data.encode('UTF-8')) 其中“data”是一个字符串。然后,这些数据以键值格式存储到Oracle NoSQL数据库中。因此,tweet本身是经过编码的。我用Java实现这一点: Value myValue = Value.createValue(msg.value().getBytes("UTF-8"

我正在使用Tweepy库(Python)和Kafka从Twitter检索推文。文本以UTF-8编码,如此行所示:

self.producer.send('my-topic', data.encode('UTF-8'))
其中“data”是一个字符串。然后,这些数据以键值格式存储到Oracle NoSQL数据库中。因此,tweet本身是经过编码的。我用Java实现这一点:

Value myValue = Value.createValue(msg.value().getBytes("UTF-8"));
最后,tweet由Java开发的格式化程序检索。为了将其存储在关系模式中,我必须解析tweet,以便将其作为字符串检索

String data = new String(value.toByteArray(),StandardCharsets.UTF_8);
如您所见,我在所有步骤中都保持UTF-8编码。然而,当我在数据库中看到tweet的文本时,它总是被剪切。例如:

RT@briIIohead:今年我不得不吞下的最难的药丸就是学会了,无论你对某人有多好,无论你对他有多好

请注意它是如何以“?”符号结尾的,并且它已被清楚地剪切。每一条长推都会发生这种情况。我的意思是,如果文本有30个字符长,那么它显示的很好,但是任何超过100个字符的内容都会被剪切

起初我认为这可能是我的表定义,但字段“Text”被声明为
VARCHAR2(400 CHAR)
,这是tweet在社交网络中可以包含的最大字符数

有没有什么好主意可以让我找出是什么在剪切文本并在末尾加上“?”符号

“数据”的外观:

{"created_at":"Tue May 28 09:23:36 +0000 2019","id":1133302792129351681,"id_str":"1133302792129351681","text":"RT @AppleEDU: Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring projects to lif\u2026","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1060510851889750022,"id_str":"1060510851889750022","name":"Rem.0112","screen_name":"0112Rem","location":"Mawson Lakes, Adelaide","url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":739,"friends_count":1853,"listed_count":10,"favourites_count":33406,"statuses_count":36936,"created_at":"Thu Nov 08 12:34:25 +0000 2018","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F5F8FA","profile_background_image_url":"","profile_background_image_url_https":"","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/1093157842163355649\/6oAdJTCs_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/1093157842163355649\/6oAdJTCs_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1060510851889750022\/1546155144","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Thu May 23 15:15:16 +0000 2019","id":1131579354964725760,"id_str":"1131579354964725760","text":"Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring proje\u2026 https:\/\/t.co\/aeeSPTXtFx","source":"\u003ca href=\"https:\/\/ads-api.twitter.com\" rel=\"nofollow\"\u003eTwitter Ads Composer\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":468741166,"id_str":"468741166","name":"Apple Education","screen_name":"AppleEDU","location":"Cupertino, CA","url":null,"description":"Spark new ideas, create more aha moments, and teach in ways you\u2019ve always imagined. Follow @AppleEDU for tips, updates, and inspiration.","translator_type":"none","protected":false,"verified":true,"followers_count":728781,"friends_count":273,"listed_count":2594,"favourites_count":13189,"statuses_count":2766,"created_at":"Thu Jan 19 21:26:14 +0000 2012","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":null,"contributors_enabled":false,"is_translator":false,"profile_background_color":"F0F0F0","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0088CC","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/892429342046691328\/2SOlm_09_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/892429342046691328\/2SOlm_09_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/468741166\/1530123538","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"Learn, create, and do more with iPad in your classroom. Get the free Everyone Can Create curriculum and bring projects to life through music, drawing, video and photography.","display_text_range":[0,173],"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]}},"quote_count":0,"reply_count":3,"retweet_count":3,"favorite_count":58,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/aeeSPTXtFx","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/1131579354964725760","display_url":"twitter.com\/i\/web\/status\/1\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"scopes":{"followers":false},"filter_level":"low","lang":"en"},"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"AppleEDU","name":"Apple Education","id":468741166,"id_str":"468741166","indices":[3,12]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1559035416048"}
我还必须提到的是,这整段代码都是经过编码的。然后解码,最后解析到数据库中。所有字段都被正确解码和解析,除了根据文档剪切的“文本”,tweet的字符数不超过“140”(这是一个宽泛的定义);但是最近他们把它改成了
280

同一份文件说:

Twitter使用文本的标准化表单C(NFC)版本统计推文的长度

因此,他们首先对文本进行规范化(我将让您了解java是如何做到这一点的)。后来他们说:

Twitter还统计文本中的代码点数量,而不是UTF-8字节

因此:


看起来最初的tweet是280个“字符”,而您使用的库并不知道这一点,所以它只使用了之前的140个字符。由于这会进行一些分块,因此分块似乎也是错误的,因为它会在最后删除一些“部分”字节。当您试图打印这些数据时,java不知道这些(最后)字节的实际含义(因为某些错误的分块),只是简单地说
(这是当它根本不理解某些内容时显示内容的默认策略)。

数据是什么样子的?它是什么类型的?如果整个文本都可用,您是否可以进行调试?例如,你是在剪切文本还是一开始就收到了全部内容?@Lino当然,我会编辑这个问题我记得twitter将其限制提升到280个字符。您正在使用的库版本是否没有调整其内部限制?例如,他们仍然希望Tweepy Github页面最多包含140个字符。不完全确定它是否已被修复,看起来不像(所有问题都是打开或关闭的,没有更改)@Lino cool!我以为图书馆坏了,你确认了!谢谢不客气,请在您的答案中添加该链接,以加强您的主张:)
String test = "RT @briIIohead: the hardest pill i had to swallow this year was learning that no matter how good you could be to somebody, no matter how mu";
System.out.println(test.codePoints().count()); // 139