http://www.laurentluce.com/posts/python-string-objects-implementation/

Python string objects implementation

June 19, 2011

This article describes how string objects are managed by Python internally and how string search is done.

PyStringObject structure
New string object
Sharing string objects
String search

PyStringObject structure

A string object in Python is represented internally by the structure PyStringObject. “ob_shash” is the hash of the string if calculated. “ob_sval” contains the string of size “ob_size”. The string is null terminated. The initial size of “ob_sval” is 1 byte and ob_sval[0] = 0. If you are wondering where “ob_size is defined”, take a look at PyObject_VAR_HEAD in object.h. “ob_sstate” indicates if the string object is in the interned dictionary which we are going to see later.

1 typedef struct {
2     PyObject_VAR_HEAD
3     long ob_shash;
4     int ob_sstate;
5     char ob_sval[1];
6 } PyStringObject;

New string object

What happens when you assign a new string to a variable like this one?

1 >>> s1 = 'abc'

The internal C function “PyString_FromString” is called and the pseudo code looks like this:

1 arguments: string object: 'abc'
2 returns: Python string object with ob_sval = 'abc'
3 PyString_FromString(string):
4     size = length of string
5     allocate string object + size for 'abc'. ob_sval will be of size: size + 1
6     copy string to ob_sval
7     return object

Each time a new string is used, a new string object is allocated.

Sharing string objects

There is a neat feature where small strings are shared between variables. This reduces the amount of memory used. Small strings are strings of size 0 or 1 byte. The global variable “interned” is a dictionary referencing those small strings. The array “characters” is also used to reference the strings of length 1 byte: i.e. single characters. We will see later how the array “characters” is used.

1 static PyStringObject *characters[UCHAR_MAX + 1];
2 static PyObject *interned;

Let’s see what happens when a new small string is assigned to a variable in your Python script.

1 >>> s2 = 'a'

The string object containing ‘a’ is added to the dictionary “interned”. The key is a pointer to the string object and the value is the same pointer. This new string object is also referenced in the array characters at the offset 97 because value of ‘a’ is 97 in ASCII. The variable “s2” is pointing to this string object.

What happens when a different variable is assigned to the same string ‘a’?

1 >>> s3 = 'a'

The same string object previously created is returned so both variables are pointing to the same string object. The “characters” array is used during that process to check if the string already exists and returns the pointer to the string object.

1 if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL)
2 {
3     ...
4     return (PyObject *)op;
5 }

Let’s create a new small string containing the character ‘c’.

1 >>> s4 = 'c'

We end up with the following:

We also find the “characters” array at use when a string’s item is requested like in the following Python script:

1 >>> s5 = 'abc'
2 >>> s5[0]
3 'a'

Instead of creating a new string containing ‘a’, the pointer at the offset 97 of the “characters” array is returned. Here is the code of the function “string_item” which is called when we request a character from a string. The argument “a” is the string object containing ‘abc’ and the argument “i” is the index requested: 0 in our case. A pointer to a string object is returned.

01 static PyObject *
02 string_item(PyStringObject *a, register Py_ssize_t i)
03 {
04     char pchar;
05     PyObject *v;
06     ...
07     pchar = a->ob_sval[i];
08     v = (PyObject *)characters[pchar & UCHAR_MAX];
09     if (v == NULL)
10         // allocate string
11     else {
12         ...
13         Py_INCREF(v);
14     }
15     return v;
16 }

The “characters” array is also used for function names of length 1:

1 >>> def a(): pass

String search

Let’s take a look at what happens when you perform a string search like in the following Python code:

1 >>> s = 'adcabcdbdabcabd'
2 >>> s.find('abcab')
3 >>> 11

The “find” function returns the index where the string ‘abcd’ is found in the string “s”. It returns -1 if the string is not found.

So, what happens internally? The function “fastsearch” is called. It is a mix between Boyer-Moore and Horspool algorithms plus couple of neat tricks.

Let’s call “s” the string to search in and “p” the string to search for. s = ‘adcabcdbdabcabd’ and p = ‘abcab’. “n” is the length of “s” and “m” is the length of “p”. n = 18 and m = 5.

The first check in the code is obvious, if m > n then we know that we won’t be able to find the index so the function returns -1 right away as we can see in the following code:

1 w = n - m;
2 if (w < 0)
3     return -1;

When m = 1, the code goes through “s” one character at a time and returns the index when there is a match. mode = FAST_SEARCH in our case as we are looking for the index where the string is found first and not the number of times the string if found.

01 if (m <= 1) {
02     ...
03     if (mode == FAST_COUNT) {
04         ...
05     else {
06         for (i = 0; i < n; i++)
07             if (s[i] == p[0])
08                 return i;
09     }
10     return -1;
11 }

For other cases i.e. m > 1. The first step is to create a compressed boyer-moore delta 1 table. Two variables will be assigned during that step: “mask” and “skip”.

“mask” is a 32-bit bitmask, using the 5 least significant bits of the character as the key. It is generated using the string to search “p”. It is a bloom filter which is used to test if a character is present in this string. It is really fast but there are false positives. You can read more about bloom filters here. This is how the bitmask is generated in our case:

1 mlast = m - 1
2 /* process pattern[:-1] */
3 for (mask = i = 0; i < mlast; i++) {
4     mask |= (1 << (p[i] & 0x1F));
5 }
6 /* process pattern[-1] outside the loop */
7 mask |= (1 << (p[mlast] & 0x1F));

First character of “p” is ‘a’. Value of ‘a’ is 97 = 1100001 in binary format. Using the 5 least significants bits, we get 00001 so “mask” is first set to: 1 << 1 = 10. Once the entire string "p" is processed, mask = 1110. How do we use this bitmask? By using the following test where "c" is the character to look for in the string "p".

1 if ((mask & (1 << (c & 0x1F))))

Is ‘a’ in “p” where p = ‘abcab’? Is 1110 & (1 << ('a' & 0X1F)) true? 1110 & (1 << ('a' & 0X1F)) = 1110 & 10 = 10. So, yes 'a' is in 'abcab'. If we test with 'd', we get false and also with the characters from 'e' to 'z' so this filter works pretty well in our case. "skip" is set to the index of the character with the same value as the last character in the string to search for. "skip" is set to the length of "p" - 1 if the last character is not found. The last character in the string to search for is 'b' which means "skip" will be set to 2 because this character can also be found by skipping over 2 characters down. This variable is used in a skip method called the bad-character skip method. In the following example: p = 'abcab' and s = 'adcabcaba'. The search starts at index 4 of "s" and checks backward if there is a string match. This first test fails at index = 1 where 'b' is different than 'd'. We know that the character 'b' in "p" is also found 3 characters down starting from the end. Because 'c' is part of "p", we skip to the following 'b'. This is the bad-character skip. 

Next is the search loop itself (real code is in C instead of Python):

01 for = 0 to n - = 13:
02     if s[i+m-1== p[m-1]:
03         if s[i:i+mlast] == p[0:mlast]:
04             return i
05         if s[i+m] not in p:
06             += m
07         else:
08             += skip
09     else:
10         if s[i+m] not in p:
11             += m
12 return -1

The test “s[i+m] not in p” is done using the bitmask. “i += skip” is the bad-character skip. “i += m” is done when the next character is not found in “p”.

Let’s see how this search algorithm works with our strings “p” and “s”. The first 3 steps are familiar. After that, the character ‘d’ is not in the string “p” so we skip the length of “p” and quickly find a match after that.

Python string objects implementation的更多相关文章

  1. Python integer objects implementation

    http://www.laurentluce.com/posts/python-integer-objects-implementation/ Python integer objects imple ...

  2. The internals of Python string interning

    JUNE 28TH, 2014Tweet This article describes how Python string interning works in CPython 2.7.7. A fe ...

  3. Python string interning原理

    原文链接:The internals of Python string interning 由于本人能力有限,如有翻译出错的,望指明. 这篇文章是讲Python string interning是如何 ...

  4. Exploring Python Code Objects

    Exploring Python Code Objects https://late.am/post/2012/03/26/exploring-python-code-objects.html Ins ...

  5. python string module

    String模块中的常量 >>> import string >>> string.digits ' >>> string.letters 'ab ...

  6. python string

    string比较连接 >>> s1="python string" >>> len(s) 13 >>> s2=" p ...

  7. Python string replace 方法

    Python string replace   方法 方法1: >>> a='...fuck...the....world............' >>> b=a ...

  8. python string与list互转

    因为python的read和write方法的操作对象都是string.而操作二进制的时候会把string转换成list进行解析,解析后重新写入文件的时候,还得转换成string. >>&g ...

  9. python string 文本常量和模版

        最近在看python标准库这本书,第一感觉非常厚,第二感觉,里面有很多原来不知道的东西,现在记下来跟大家分享一下.     string类是python中最常用的文本处理工具,在python的 ...

随机推荐

  1. bjfu1100 圆环

    这题也是2011百度之星的一道题.知道做法后代码极简单. 不过我做完后随便上网搜了一下,发现竟然还有很多不同的做法.别的做法我就不管了,我只把我的做法的原理说清楚.我做题时是按如下顺序逐步找到规律的: ...

  2. IOS UIScrollView中 使用 touch 无法响应的问题

    添加一个 Category  然后在使用到 UIScrollView 的文件里面 导入这个头文件 就可以 // //  UIScrollView+UITouch.m //  alarm // //  ...

  3. CSS基础知识——选择器

    选择器 元素选择器# 文档元素为最基本的选择器 例子:div{属性:值}; 选择器分组 例子:h2,p{属性:值}; 表示符合这两种规则的元素设置相同的属性值 通配选择器 表示所有元素 类选择器 应用 ...

  4. 我用dedecms有感

    ---恢复内容开始--- 最近接了一个私单,简单的学校网站,注意,我一看上去是感觉很快,仿站,对方说这个东西你三天就能搞定啦,我也这么想的 (没经验啊) 接下来,我想都没想就用dedecms去做,之前 ...

  5. SVM应用

    我在项目中应用的SVM库是国立台湾大学林智仁教授开发的一套开源软件,主要有LIBSVM与LIBLINEAR两个,LIBSVM是对非线性数据进行分类,大家也比较熟悉,LIBLINEAR是对线性数据进行分 ...

  6. hadoop2.6.0汇总:新增功能最新编译 32位、64位安装、源码包、API下载及部署文档

    相关内容: hadoop2.5.2汇总:新增功能最新编译 32位.64位安装.源码包.API.eclipse插件下载Hadoop2.5 Eclipse插件制作.连接集群视频.及hadoop-eclip ...

  7. ZOJ3772 - Calculate the Function(线段树+矩阵)

    题目大意 给定一个序列A1 A2 .. AN 和M个查询 每个查询含有两个数 Li 和Ri. 查询定义了一个函数 Fi(x) 在区间 [Li, Ri] ∈ Z. Fi(Li) = ALi Fi(Li ...

  8. grep in linux

    1.作用linux系统中grep命令是一种强大的文本搜索工具,它能使用正则表达式搜索文本,并把匹配的行打印出来.grep全称是Global Regular Expression Print,表示全局正 ...

  9. 第二百七十一天 how can I 坚持

    每天的内容应该是这个样子,做了什么,收获了什么,有哪些东西感动了你. 就像昨天看了个电影<解救吾先生>,看完没点感觉或感受是不可能的,刘德华扮演的吾先生最终获救,不仅仅是靠运气,多少还是因 ...

  10. 使用Core Data应避免的十个错误

    原文:Avoiding Ten Big Mistakes iOS Developers Make with Core Data   http://www.cocoachina.com/applenew ...