在python中,如何从字符串中提取特定字符?

在python中,如何从字符串中提取特定字符?,python,string,Python,String,我需要从这个字符串中提取日期和位置。有没有更有效的方法?这也不太容易出错,例如,时间前面的单词可能并不总是来自 text = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place from 3:15-4:00 PM EST and leaves from the Admissions Office in x House. No registration

我需要从这个字符串中提取日期和位置。有没有更有效的方法?这也不太容易出错,例如,时间前面的单词可能并不总是来自

text = 'Join us for a guided tour of the Campus given by the 
Admissions staff. The tour will take place from 3:15-4:00 PM EST 
and leaves from the Admissions Office in x House. No registration required.' 

length = len(text)

for x in range (length):
    if text[x] == 'f' :
        if text[x+1] == 'r' :
            if text[x+2] == 'o':
                if text[x+3] == 'm':
                   fprint(text[x:(x+17)])
                   fbreak

=下午3:15-4:00要从时间范围提取开始时间,请使用正则表达式:

(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b

详细信息

  • (?i)
    -上不区分大小写的匹配
  • \b
    -前导词边界
  • (\d{1,2}:\d{2})
    -第1组捕获1或2个数字,
    和2个数字
  • (?:-\d{1,2}:\d{2})?
    -一个可选的非捕获组,与下列事件的1或0次匹配:
    • -
      -连字符
    • \d{1,2}
      -1或2位数字
    • -冒号
    • \d{2}
      -2位数字
  • (\s*[pa]m)
    -第2组捕获以下序列:
    • \s*
      -0+空格
    • [pa]
      -
      p
      a
      (或
      p
      a
    • m
      -
      m
      m
  • \b
    -尾随词边界
见:


由于结果分为两个单独的组,因此我们需要迭代所有匹配并计算两个组的值。

要从时间范围中提取开始时间,请使用正则表达式:

(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b

详细信息

  • (?i)
    -上不区分大小写的匹配
  • \b
    -前导词边界
  • (\d{1,2}:\d{2})
    -第1组捕获1或2个数字,
    和2个数字
  • (?:-\d{1,2}:\d{2})?
    -一个可选的非捕获组,与下列事件的1或0次匹配:
    • -
      -连字符
    • \d{1,2}
      -1或2位数字
    • -冒号
    • \d{2}
      -2位数字
  • (\s*[pa]m)
    -第2组捕获以下序列:
    • \s*
      -0+空格
    • [pa]
      -
      p
      a
      (或
      p
      a
    • m
      -
      m
      m
  • \b
    -尾随词边界
见:


由于结果分为两个单独的组,我们需要迭代所有匹配项并计算两个组的值。

您可以使用以下正则表达式:

[^A-Za-z]+“
r”中的“

它在文本中签入一个以“from”开头的位置,之后没有任何字母(除了AM或PM)。在您提供的文本上返回

下午3:15-4:00

您可以通过以下方式使用它:

重新导入
打印(重新搜索(“从[^A-Za-z]+(?:AM | PM)”,文本)

您可以使用以下正则表达式:

[^A-Za-z]+“
r”中的“

它在文本中签入一个以“from”开头的位置,之后没有任何字母(除了AM或PM)。在您提供的文本上返回

下午3:15-4:00

您可以通过以下方式使用它:

重新导入
打印(重新搜索(“从[^A-Za-z]+(?:AM | PM)”,文本)

您不仅限于使用正则表达式解析字符串内容

您可以使用下面描述的解析技术,而不是使用正则表达式。它类似于编译器中使用的技术


该技术的简单示例 首先,你可以看看这个例子。它只能在文本中找到时间

TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

TIME_SEPARATORS = ':-'

time_text_start = None
time_text_end = None
time_text = ''

index = 0
for char in TEXT:
    if time_text_start is None:
        if char.isdigit():
            time_text_start = index
    if (time_text_start is not None) and (time_text_end is None):
        if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS):
            time_text_end = index
            time_text = TEXT[time_text_start: time_text_end].strip()

            print(time_text)

            # Now we will clear our variables to be able to find next time_text data in the text
            time_text_start = None
            time_text_end = None
            time_text = ''
    index += 1
下一步将打印此代码:

3:15-4:00
7:30
17:30
9:30-11:00
15:00-16:25

实码 现在您可以查看真正的代码。它将找到您需要的所有数据:时间、时段、时间标准和位置

文本中的位置必须位于时间之后以及单词“in”和“home”之间

要添加其他搜索条件,您可以修改
EventsDataFinder
类的
def find(self,text\u To\u过程)
方法

要更改格式(例如仅返回结束时间的完整时间),您可以修改
def\u prepare\u event\u数据(时间文本、时间段、时间标准、事件地点)
EventsDataFinder类的方法

附言:我知道对于初学者来说,这些课程可能很难理解。所以我试着让这段代码尽可能简单。但如果没有类,代码将很难理解。所以有一些

class TextUnit:
    text = ''
    start = None
    end = None
    absent = False

    def fill_from_text(self, text):
        self.text = text[self.start: self.end].strip()

    def clear(self):
        self.text = ''
        self.start = None
        self.end = None
        self.absent = False


class EventsDataFinder:
    time_standards = {
        'est',
        'utc',
        'dst',
        'edt'
    }
    time_standard_text_len = 3

    period = {
        'am',
        'pm'
    }
    period_text_len = 2

    time_separators = ':-'

    event_place_start_indicator = ' in '
    event_place_end_indicator = ' house'

    fake_text_end = '.'

    def find(self, text_to_process):
        '''
        This method will parse given text and will return list of tuples. Each tuple will contain time of the event
        in the desired format and location of the event.
        :param text_to_process: text to parse
        :return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')]
        '''
        text = text_to_process.replace('\n', '')
        text += self.fake_text_end

        time_text = TextUnit()
        time_period = TextUnit()
        time_standard = TextUnit()
        event_place = TextUnit()

        result_events = list()

        index = -1
        for char in text:
            index += 1

            # Time text
            if time_text.start is None:
                if char.isdigit():
                    time_text.start = index
            if (time_text.start is not None) and (time_text.end is None):
                if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators):
                    time_text.end = index
                    time_text.fill_from_text(text)

            # Time period
            # If time_text is already found:
            if (time_text.end is not None) and \
                    (time_period.end is None) and (not time_period.absent) and \
                    (not char.isspace()):
                potential_period = text[index: index + self.period_text_len].lower()
                if potential_period in self.period:
                    time_period.start = index
                    time_period.end = index + self.period_text_len
                    time_period.fill_from_text(text)
                else:
                    time_period.absent = True

            # Time standard
            # If time_period is already found or does not exist:
            if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \
                    (time_standard.end is None) and (not time_standard.absent) and \
                    (not char.isspace()):
                potential_standard = text[index: index + self.time_standard_text_len].lower()
                if potential_standard in self.time_standards:
                    time_standard.start = index
                    time_standard.end = index + self.time_standard_text_len
                    time_standard.fill_from_text(text)
                else:
                    time_standard.absent = True

            # Event place
            # If time_standard is already found or does not exist:
            if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \
                    (event_place.end is None) and (not event_place.absent):
                if self.event_place_end_indicator.startswith(char.lower()):
                    potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower()
                    if potential_event_place == self.event_place_end_indicator:
                        event_place.end = index
                        potential_event_place_start = text.rfind(self.event_place_start_indicator,
                                                                 time_text.end,
                                                                 event_place.end)
                        if potential_event_place_start > 0:
                            event_place.start = potential_event_place_start + len(self.event_place_start_indicator)
                            event_place.fill_from_text(text)
                        else:
                            event_place.absent = True

            # Saving result and clearing temporary data holders
            # If event_place is already found or does not exist:
            if event_place.absent or (event_place.end is not None):
                result_events.append(self._prepare_event_data(time_text,
                                                              time_period,
                                                              time_standard,
                                                              event_place))
                time_text.clear()
                time_period.clear()
                time_standard.clear()
                event_place.clear()

        # This code will save data of the last incomplete event (all that was found). If it exists of course.
        if (time_text.end is not None) and (event_place.end is None):
            result_events.append(self._prepare_event_data(time_text,
                                                          time_period,
                                                          time_standard,
                                                          event_place))

        return result_events

    @staticmethod
    def _prepare_event_data(time_text, time_period, time_standard, event_place):
        '''
        This method will prepare found data to be saved in a desired format
        :param time_text: text of time
        :param time_period: text of period
        :param time_standard: text of time standard
        :param event_place: location of the event
        :return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA')
        '''
        event_time = time_text.text  # '3:15-4:00'
        split_time = event_time.split('-')  # ['3:15', '4:00']
        if 1 < len(split_time):
            # If it was, for example, '3:15-4:00 PM EST' in the text
            start_time = split_time[0].strip()  # '3:15'
            end_time = split_time[1].strip()  # '4:00'
        else:
            # If it was, for example, '3:15 PM EST' in the text
            start_time = event_time  # '3:15'
            end_time = ''  # ''
        period = time_period.text.upper()  # 'PM'
        standard = time_standard.text.upper()  # 'EST'
        event_place = event_place.text  #

        # Removing empty time fields (for example if there is no period or time standard in the text)
        time_data_separated = [start_time, period, standard]
        new_time_data_separated = list()
        for item in time_data_separated:
            if item:
                new_time_data_separated.append(item)
        time_data_separated = new_time_data_separated

        event_time_interval = ' '.join(time_data_separated)
        result = (event_time_interval, event_place)

        return result


TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

edf = EventsDataFinder()

print(edf.find(TEXT))

您不仅限于使用正则表达式来解析字符串内容

您可以使用下面描述的解析技术,而不是使用正则表达式。它类似于编译器中使用的技术


该技术的简单示例 首先,你可以看看这个例子。它只能在文本中找到时间

TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

TIME_SEPARATORS = ':-'

time_text_start = None
time_text_end = None
time_text = ''

index = 0
for char in TEXT:
    if time_text_start is None:
        if char.isdigit():
            time_text_start = index
    if (time_text_start is not None) and (time_text_end is None):
        if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS):
            time_text_end = index
            time_text = TEXT[time_text_start: time_text_end].strip()

            print(time_text)

            # Now we will clear our variables to be able to find next time_text data in the text
            time_text_start = None
            time_text_end = None
            time_text = ''
    index += 1
下一步将打印此代码:

3:15-4:00
7:30
17:30
9:30-11:00
15:00-16:25

实码 现在您可以查看真正的代码。它将找到您需要的所有数据:时间、时段、时间标准和位置

文本中的位置必须位于时间之后以及单词“in”和“home”之间

要添加其他搜索条件,您可以修改
EventsDataFinder
类的
def find(self,text\u To\u过程)
方法

要更改格式(例如仅返回结束时间的完整时间),您可以修改
def\u prepare\u event\u数据(时间文本、时间段、时间标准、事件地点)
EventsDataFinder类的方法

附言:我知道对于初学者来说,这些课程可能很难理解。所以我试着让这段代码尽可能简单。但如果没有类,代码将很难理解。所以有一些

class TextUnit:
    text = ''
    start = None
    end = None
    absent = False

    def fill_from_text(self, text):
        self.text = text[self.start: self.end].strip()

    def clear(self):
        self.text = ''
        self.start = None
        self.end = None
        self.absent = False


class EventsDataFinder:
    time_standards = {
        'est',
        'utc',
        'dst',
        'edt'
    }
    time_standard_text_len = 3

    period = {
        'am',
        'pm'
    }
    period_text_len = 2

    time_separators = ':-'

    event_place_start_indicator = ' in '
    event_place_end_indicator = ' house'

    fake_text_end = '.'

    def find(self, text_to_process):
        '''
        This method will parse given text and will return list of tuples. Each tuple will contain time of the event
        in the desired format and location of the event.
        :param text_to_process: text to parse
        :return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')]
        '''
        text = text_to_process.replace('\n', '')
        text += self.fake_text_end

        time_text = TextUnit()
        time_period = TextUnit()
        time_standard = TextUnit()
        event_place = TextUnit()

        result_events = list()

        index = -1
        for char in text:
            index += 1

            # Time text
            if time_text.start is None:
                if char.isdigit():
                    time_text.start = index
            if (time_text.start is not None) and (time_text.end is None):
                if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators):
                    time_text.end = index
                    time_text.fill_from_text(text)

            # Time period
            # If time_text is already found:
            if (time_text.end is not None) and \
                    (time_period.end is None) and (not time_period.absent) and \
                    (not char.isspace()):
                potential_period = text[index: index + self.period_text_len].lower()
                if potential_period in self.period:
                    time_period.start = index
                    time_period.end = index + self.period_text_len
                    time_period.fill_from_text(text)
                else:
                    time_period.absent = True

            # Time standard
            # If time_period is already found or does not exist:
            if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \
                    (time_standard.end is None) and (not time_standard.absent) and \
                    (not char.isspace()):
                potential_standard = text[index: index + self.time_standard_text_len].lower()
                if potential_standard in self.time_standards:
                    time_standard.start = index
                    time_standard.end = index + self.time_standard_text_len
                    time_standard.fill_from_text(text)
                else:
                    time_standard.absent = True

            # Event place
            # If time_standard is already found or does not exist:
            if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \
                    (event_place.end is None) and (not event_place.absent):
                if self.event_place_end_indicator.startswith(char.lower()):
                    potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower()
                    if potential_event_place == self.event_place_end_indicator:
                        event_place.end = index
                        potential_event_place_start = text.rfind(self.event_place_start_indicator,
                                                                 time_text.end,
                                                                 event_place.end)
                        if potential_event_place_start > 0:
                            event_place.start = potential_event_place_start + len(self.event_place_start_indicator)
                            event_place.fill_from_text(text)
                        else:
                            event_place.absent = True

            # Saving result and clearing temporary data holders
            # If event_place is already found or does not exist:
            if event_place.absent or (event_place.end is not None):
                result_events.append(self._prepare_event_data(time_text,
                                                              time_period,
                                                              time_standard,
                                                              event_place))
                time_text.clear()
                time_period.clear()
                time_standard.clear()
                event_place.clear()

        # This code will save data of the last incomplete event (all that was found). If it exists of course.
        if (time_text.end is not None) and (event_place.end is None):
            result_events.append(self._prepare_event_data(time_text,
                                                          time_period,
                                                          time_standard,
                                                          event_place))

        return result_events

    @staticmethod
    def _prepare_event_data(time_text, time_period, time_standard, event_place):
        '''
        This method will prepare found data to be saved in a desired format
        :param time_text: text of time
        :param time_period: text of period
        :param time_standard: text of time standard
        :param event_place: location of the event
        :return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA')
        '''
        event_time = time_text.text  # '3:15-4:00'
        split_time = event_time.split('-')  # ['3:15', '4:00']
        if 1 < len(split_time):
            # If it was, for example, '3:15-4:00 PM EST' in the text
            start_time = split_time[0].strip()  # '3:15'
            end_time = split_time[1].strip()  # '4:00'
        else:
            # If it was, for example, '3:15 PM EST' in the text
            start_time = event_time  # '3:15'
            end_time = ''  # ''
        period = time_period.text.upper()  # 'PM'
        standard = time_standard.text.upper()  # 'EST'
        event_place = event_place.text  #

        # Removing empty time fields (for example if there is no period or time standard in the text)
        time_data_separated = [start_time, period, standard]
        new_time_data_separated = list()
        for item in time_data_separated:
            if item:
                new_time_data_separated.append(item)
        time_data_separated = new_time_data_separated

        event_time_interval = ' '.join(time_data_separated)
        result = (event_time_interval, event_place)

        return result


TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
       'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
       'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
       'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
       'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
       'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
       'No registration required. '

edf = EventsDataFinder()

print(edf.find(TEXT))

使用正则表达式是的,只需将正则表达式解决方案的简单性与下面一个答案中建议的NNN行代码进行比较。使用正则表达式是的,只需将正则表达式解决方案的简单性与下面一个答案中建议的NNN行代码进行比较。谢谢!H