使用Python:Regex进行文本抓取_Python_Regex

使用Python:Regex进行文本抓取

python regex

使用Python:Regex进行文本抓取,python,regex,Python,Regex,我有一个动态文本，看起来像这样 my_text = "address ae fae daq ad, 1231 asdas landline 213121233 -123 mobile 513121233 cell (132) -142-3127 email sdasdas@gmail.com , sdasd as@yahoo.com - ewqas@gmail.com" 文本以“地址”开头。一旦我们看到“地址”，我们需要从那里抓取所有信息，直到出现“固定电话”

我有一个动态文本，看起来像这样

my_text = "address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  
           email sdasdas@gmail.com , sdasd as@yahoo.com - ewqas@gmail.com"

文本以“地址”开头。一旦我们看到“地址”，我们需要从那里抓取所有信息，直到出现“固定电话”/“移动电话”/“手机”。从那时起，我们希望在所有手机短信都被删除时（不改变其间的空格）。我们从第一次出现“固定电话”/“移动电话”/“手机”开始，一发现“电子邮件”出现就停止。最后，我们将删除电子邮件部分（不改变其间的空格）

“固定电话”/“移动电话”/“手机”可以以任何顺序出现，有时有些可能不会出现。例如，文本也可能是这样的

my_text = "address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email sdasdas@gmail.com , sdasd as@yahoo.com - ewqas@gmail.com"

要形成包含在地址、电话和电子邮件文本中的子文本数组，还需要做更多的工程。地址的子文本始终用逗号（，）分隔。电子邮件的子文本可以用逗号（，）或连字符（-）分隔

我的输出应该是一个JSON字典，如下所示：

resultant_dict = {
                      addresses: [
                                  { address: "ae fae daq ad" }
                                , { address: "1231 asdas" }
                               ]
                    , phones: [
                                  { number: "213121233 -123", kind: "landline" }
                                , { number: "513121233", kind: "mobile" }
                                , { number: "(132 -142-3127", kind: "cell" }
                             ]
                    , emails: [
                                  { email: "sdasdas@gmail.com", connector: "" }
                                , { email: "sdasd as@yahoo.com", connector: "," }
                                , { email: "ewqas@gmail.com", connector: "-" }
                              ]
}

text = """address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email sdasdas@gmail.com , sdasd as@yahoo.com - ewqas@gmail.com"""

class Scraper:
    def __init__(self):
        self.current = []
        self.current_type = None

    def emit(self):
        if self.current:
            # TODO: Add the new item to a dictionary.
            # Later, translate the dictionary to JSON format.
            print(self.current_type, self.current)

    def scrape(self, input_text):
        tokens = input_text.split()
        for token in tokens:
            if token in ('address', 'cell', 'landline', 'email'):
                self.emit()
                self.current = []
                self.current_type = token
            else:
                self.current.append(token)
        self.emit()

s = Scraper()
s.scrape(text)

我正在尝试使用正则表达式或Python中的任何其他方法来实现这一点。我不知道如何编写它，因为我是一个新手程序员。

这对于正则表达式来说不是一个好工作，因为您要从输入中解析的组件可以以任何顺序和数字出现

考虑使用词法分析和解析库，例如解析表达式语法

另一种方法是使用

str.split（）

或

re.split（）

将输入文本拆分为标记。然后扫描这些标记，查找关键字，如

地址

、

单元格

、和

、

，将以下标记累积到下一个关键字。这种方法允许

split（）

完成标记化工作的第一部分，剩下的词法工作（通过识别关键字）和解析工作由您手动完成

手动方法更具指导性，但更冗长，灵活性更低。事情是这样的：

resultant_dict = {
                      addresses: [
                                  { address: "ae fae daq ad" }
                                , { address: "1231 asdas" }
                               ]
                    , phones: [
                                  { number: "213121233 -123", kind: "landline" }
                                , { number: "513121233", kind: "mobile" }
                                , { number: "(132 -142-3127", kind: "cell" }
                             ]
                    , emails: [
                                  { email: "sdasdas@gmail.com", connector: "" }
                                , { email: "sdasd as@yahoo.com", connector: "," }
                                , { email: "ewqas@gmail.com", connector: "-" }
                              ]
}

text = """address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email sdasdas@gmail.com , sdasd as@yahoo.com - ewqas@gmail.com"""

class Scraper:
    def __init__(self):
        self.current = []
        self.current_type = None

    def emit(self):
        if self.current:
            # TODO: Add the new item to a dictionary.
            # Later, translate the dictionary to JSON format.
            print(self.current_type, self.current)

    def scrape(self, input_text):
        tokens = input_text.split()
        for token in tokens:
            if token in ('address', 'cell', 'landline', 'email'):
                self.emit()
                self.current = []
                self.current_type = token
            else:
                self.current.append(token)
        self.emit()

s = Scraper()
s.scrape(text)

这将产生：

address ['ae', 'fae', 'daq', 'ad,', '1231', 'asdas']
cell ['(132)', '-142-3127']
landline ['213121233', '-123']
email ['sdasdas@gmail.com', ',', 'sdasd', 'as@yahoo.com', '-', 'ewqas@gmail.com']

您需要使用

re.split（）

将

'ad'，

拆分为

['ad'，'，']

，添加代码来处理标记，如

，

，并使用库将字典转换为JSON格式。

只要电子邮件中没有空格，这将起作用

import re
my_text = 'address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  email sdasdas@gmail.com , sdasdas@yahoo.com - ewqas@gmail.com'

split_words = ['address', 'landline', 'mobile', 'cell', 'email']
resultant_dict = {'addresses': [], 'phones': [], 'emails': []}

for sw in split_words:

    text = filter(None, my_text.split(sw))
    text = text[0].strip() if len(text) < 2 else text[1].strip()
    next_split = [x.strip() for x in text.split() if x.strip() in split_words]

    if next_split:
        text = text.split(next_split[0])[0].strip()

    if sw in ['address']:
        text = text.split(',')
        for t in text:
            resultant_dict['addresses'].append({'address': t.strip()})

    elif sw in ['landline', 'mobile', 'cell']:
        resultant_dict['phones'].append({'number': text, 'kind': sw})

    elif sw in ['email']:

        connectors = [',', '-']
        emails = re.split('|'.join(connectors), text)
        text = filter(None, [x.strip() for x in text.split()])

        for email in emails:

            email = email.strip()
            connector = ''
            index = text.index(email) if email in text else 0

            if index > 0:
                connector = text[index - 1]

            resultant_dict['emails'].append({'email': email, 'connector': connector})

print resultant_dict

重新导入
my_text='地址ae fae daq ad，1231 asdas固定电话21312233-123手机51312233手机（132）-142-3127电子邮件sdasdas@gmail.com , sdasdas@yahoo.com - ewqas@gmail.com'
split_words=[“地址”、“固定电话”、“手机”、“手机”、“电子邮件”]
结果_dict={‘地址’：[]，‘电话’：[]，‘电子邮件’：[]}
对于拆分字中的sw：
text=过滤器（无，my_text.split（sw））
text=text[0]。如果len（text）<2，则为strip（），否则为text[1]。strip（）
next_split=[x.strip（）表示文本中的x.split（），如果x.strip（）表示拆分单词]
如果下一次拆分：
text=text.split（下一次分割[0]）[0].strip（）
如果sw位于[“地址”]：
text=text.split（'，'）
对于文本中的t：
结果[u dict['address'].append（{'address'：t.strip（）}）
[“固定线路”、“移动设备”、“手机”中的elif sw：
结果dict['phones'].append（{'number'：text，'kind'：sw}）
[电子邮件]中的elif sw：
连接器=['，'，'-']
电子邮件=重新拆分（“|”.连接（连接器），文本）
text=filter（无，[x.strip（）表示text.split（）中的x]
对于电子邮件中的电子邮件：
email=email.strip（）
连接器=“”
索引=文本。如果文本中的电子邮件为0，则索引（电子邮件）
如果索引>0：
连接器=文本[索引-1]
结果目录['email'].append（{'email'：email，'connector'：connector}）
打印结果记录

谢谢。除了正则表达式，你能提供一个有效的解决方案吗。注意：程序处理像

ad、

这样的情况越复杂，pyPEG的性能就越好。这个答案越复杂，对你和其他读者的启发就越少。还要注意输入解析代码

parse（）

与输出构造代码

emit（）

是如何分开的。这种模块化使得理解、调试和修改变得更容易。这很有效。但我想保留这些空间是有原因的。我将尝试相应地编辑代码并进行更新。如果您能快速调整以包含空格，您也可以添加它：）