UVa OJ 175 - Keywords (关键字)

Time limit: 3.000 seconds
限时3.000秒

Problem
问题

Many researchers are faced with an ever increasing number of journal articles to read and find it difficult to locate papers of relevance to their particular lines of research. However, it is possible to subscribe to various services which claim that they will find articles that fit an `interest profile' that you supply, and pass them on to you. One simple way of performing such a search is to determine whether a pair of keywords occurs `sufficiently' close to each other in the title of an article. The threshold is determined by the researchers themselves, and refers to the number of words that may occur between the pair of keywords. Thus an archeologist interested in cave paintings could specify her profile as ``0 rock art'', meaning that she wants all titles in which the words ``rock'' and ``art'' appear with 0 words in between, that is next to each other. This would select not only ``Rock Art of the Maori'' but also ``Pop Art, Rock, and the Art of Hang-glider Maintenance''.
许多研究人员都面临这样一个问题：阅读的期刊文章数量与日俱增，要找到与他们特定研究方向相关的文章困难重重。然而，有一些订阅服务声称它们可以按你制定的“兴趣配置”找到匹配的文章，并传送给你。一种简单的方式就是执行这样一种搜索：确定文章中是否有一对单词出现的“足够” 靠近。研究人员设定一个阈值，指出一对单词之间应出现的单词数量。例如一个考古学家对岩洞壁画感兴趣，就会指定她的兴趣配置为“0 rock

art”，意思是她希望标题中出现“rock”和“art”且间隔为0单词的所有文章，即这两个单词彼此相临。这样的兴趣配置会选出的标题包括“Rock
Art of the Maori”和“Pop Art, Rock, and the Art of Hang-glider
Maintenance”等。

Write a program that will read in a series of
profiles followed by a series of titles and determine which of the
titles (if any) are selected by each of the profiles. A title is
selected by a profile if at least one pair of keywords from the profile
is found in the title, separated by no more than the given threshold.
For the purposes of this program, a word is a sequence of letters,
preceded by one or more blanks and terminated by a blank or the end of
line marker.
写一个程序，读入一系列的配置文件，再读入一系列的标题，确定哪些标题（如果有）会被各配置选中。一个标题被一个配置选中仅当配置中的至少一对单词出现在标题中，并且间隔没有超过给定的阈值。对于这个程序而言，一个单词就是字母的序列，前面有一个或多个空白，并以空白或行结束符作为结束。

Input
输入

Input will consist of no more than 50 profiles followed by no more than 250 titles. Each profile and title will be numbered in the order of their appearance, starting from 1, although the numbers will not appear in the file.
输入的配置不会超过50个，标题不会超过250个。每一个配置和标题都以给出的顺序编号（从1开始计数），但编号并不会在输入中给出。

Each
profile will start with the characters ``P:'', and will consist of a
number representing a threshold, followed by two or more keywords in
lower case.
每个配置都以字符“P:”开始，包括一个表示阈值的数，接下来是两个或更多的关键字，均为小写形式。

Each
title will start with the characters ``T:'', and will consist of a
string of characters terminated by ``|''. The character ``|'' will not
occur anywhere in a title except at the end. No title will be longer
than 255 characters, and if necessary it will flow on to more than one
line. No line will be longer than eighty characters and each
continuation line of a title will start with at least one blank. Line
breaks will only occur between words.
每个标题都以字符“T:”开始，会包括一个以“|”作为结束的字符串。字符“|”不会出现在标题中除末尾外的任何位置。标题都不会超过255个字节，如果必要会分成多行给出。所有行的长度都不会超过80个字符，且标题的每个续行都以至少一个空白作为开始。换行只会出现在单词之间。

All non-alphabetic characters are to
be ignored, thus the title ``Don't Rock -- the Boat as Metaphor in
1984'' would be treated as ``Dont Rock the Boat as Metaphor in'' and
``HP2100X'' will be treated as ``HPX''. The file will be terminated by a
line consisting of a single #.
所有非字母的字符都应忽略，例如标题“Don't Rock -- the
Boat as Metaphor in 1984”应被当作“Dont Rock the Boat as Metaphor
in”处理，“HP2100X”将被当作“HPX”处理。输入文件以只有一个#的一行作为结束。

Output
输出

Output will consist of a series of lines, one for each profile in the input. Each line will consist of the profile number (the number of its appearance in the input) followed by ``:'', a blank space, and the numbers of the selected titles in numerical order, separated by commas and with no spaces.
输出由多行构成，每行对应输入的一个配置。每行都应包括配置的编号（配置在输入中的编号）跟着一个“:”，一个空格，然后是以数字顺序排列的选中标题的编号，中间以逗号隔开，不要空格。

Sample input
示例输入

Sample output
示例输出

Analysis
分析

这道题重点考察对搜索匹配问题的建模能力，实际和字符串处理关系不大。要注意以下几点：

所有非字母的字符都不处理；
仅以空格或换行作为单词的分隔符；
单词均以小写形式处理；
配置中的单词任两个都要算做一对。

前三条原则实际上就把单词给量化了，如果对单词编号建表，那么配置和标题就都成为了一堆数字（每个数字皆为单词的编号）

一、以正确的方式处理输入的配置，录入全部配置中的单词。遍历配置中的所有单词，建立从单词到编号的对应表（即单词表），此处可以使用stl中的map作为单词表的数据结构。接下来用单词表规格化处理所有的配置和标题，标题中不在单词表中的单词可用-1标记。

二、建立数据配置中的单词对数据。对于配置中的每一对单词，实际上是一个三元组：（单词对，阈值，所属配置编号）。由于在第一步已经将所有单词规格化为数字了，因此单词对就是两个整数。考虑单词的总数一定不会上万，且单词对中两个单词的顺序无所谓，因此可以用两个字节表示一个编号，然后将较小的编号放在高字节，较大的放低位构成一个4字节的整数，这个整数就可以唯一的表示一个单词对。那么所有配置中的单词对的数据就可以多种形式来表达了，这里使用map映射，key是单词对的整数，value是一个结构体的动态数组，结构体中包括阈值和所属配置编号。

三、建立标题中的单词对数据。标题中的单词包括非关键词（编号为-1）和关键词，要求出各标题中每一对存在的关键词的最短距离，并用一种数据结构表达。这里使用和第二步相似的数据结构，每一对关键词是一个三元组：（单词对，最短距离，所属标题编号）。查找标题中所有关键词对的最短距离用暴力搜索就可以了，遍历的顺序和冒泡排序一样，复杂度是n²。所有数值求出来后，建立map映射，key是单词对的整数，value是一个结构体的动态数组，结构体中包括最短距离和所属标题的编号。

四、最后就是比对配置的映射表和标题的映射表，找出相同的单词对，然后比对各自的数组。如果有最短距离小于或等于阈值的，那就在这个标题编号和配置编号建立一个联系。找出所有的配置编号-标题编号关系后，按配置编号排序，整理输出即可。

Solution
解答

#include <algorithm>

#include <iostream>

#include <string>

#include <vector>

#include <map>

#include <utility>

typedef unsigned long ulong;

typedef unsigned short ushort;

// 用于存储profile中的阈值和转成数字序列的关键词组合

struct PROFILE

{

	size_t nThreshold;

	std::vector<ushort> nArray;

};

// 用于存储profile中的阈值和profile的编号，title中的包含的两个关键字之间的距离和title的编号

struct INFO

{

	size_t nDist;

	size_t nIdx;

};

typedef std::vector<std::string> VECSTR;

typedef std::vector<ushort> ARRAY;

typedef std::vector<ARRAY> MATRIX;

typedef std::map<ulong, std::vector<INFO> > MAPINFO;

typedef std::pair<size_t, size_t> PAIR;

// 将keywords对中的两个单词用数字序列表示，用一个unsigned short数据类型存储

ulong MakeWordPair(ushort w1, ushort w2)

{

	return (w1 > w2)? (w1 | (w2 << 16)) : (w2 | (w1 << 16));

}

// 排序过程，重载“<”运算符

bool operator < (const INFO &f1, const INFO &f2)

{

	return (f1.nDist < f2.nDist || (f1.nDist == f2.nDist && f1.nIdx < f2.nIdx));

}

// 去重过程，重载“==”运算符

bool operator == (const INFO &f1, const INFO &f2)

{

	return (f1.nDist == f2.nDist && f1.nIdx == f2.nIdx);

}

int main(void)

{

	VECSTR profileStrs, titleStrs;

	for (std::string str; getline(std::cin, str) && str[0] != '#'; ) {

		// 读入数据，若以“P：”开头，则表示profile，若以“T：”开头，则表示title，若以空格或者tab开头，则承接上一个title。

		switch(str[0]) {

		case 'P':

			profileStrs.push_back(std::string(str.begin() + 2, str.end()));

			break;

		case 'T':

			titleStrs.push_back(std::string(str.begin() + 2, str.end()));

			break;

		case ' ':

		case '\t':

			titleStrs.back() += str;

			break;

		}

	}

	std::map<std::string, ushort> wordTbl;	     // 用于给每一个keywords编号，keywords与编号的映射关系存入wordTbl中

	std::vector<PROFILE> arrProfile;	         // 将每个profile中的keywords序列转化为相应的keywords编号序列

	for (VECSTR::iterator i = profileStrs.begin(); i != profileStrs.end(); ++i) {

		i->push_back(' ');

		std::string::iterator iBeg = i->begin();

		// 由于profile由阈值和keywords串组成，遍历profile字符串，找到阈值的起始位置

		for (; iBeg != i->end() && !isdigit(*iBeg); ++iBeg);

		// 找到阈值的结束位置，读取阈值

		std::string strThre;

		std::string::iterator iEnd = iBeg;

		for (; iEnd != i->end() && isdigit(*iEnd); ++iEnd)

			strThre.push_back(*iEnd);

		// 保存每一个profile的阈值和由keywords的编号组成的序列

		arrProfile.push_back(PROFILE());

		PROFILE &cur = arrProfile.back();

		// 将阈值由文本形式转为数值形式

		cur.nThreshold = atoi(strThre.c_str());

		//用于存储keywords中读取的单词

		std::string word;

		for (std::string::iterator j = iEnd; j != i->end(); ++j) {

			if (*j != ' ' && *j != '\t')

				word.push_back(*j);

			else if (!word.empty()) {

				// 更新keywords与编号的映射表

				ushort &wordIdx = wordTbl[word];

				if (wordIdx == 0)

					wordIdx = wordTbl.size();

				// 存储keywords编号序列

				cur.nArray.push_back(wordIdx);

				word.clear();

			}

		}

	}

	// 原输入为一个profile对应一组keywords pair，将其转变为一个keywords pair对应一个profile编号组，建立映射关系

	MAPINFO profileTbl;

	for (std::vector<PROFILE>::iterator i = arrProfile.begin(); i != arrProfile.end(); ++i)	{

		// 所有的keywords两两组合作为一个keywords pair

		for (ARRAY::iterator j = i->nArray.begin(); j != i->nArray.end() - 1; ++j) {

			for (ARRAY::iterator k = j + 1; k != i->nArray.end(); ++k) {

				INFO info = {i->nThreshold, i - arrProfile.begin()};

				profileTbl[MakeWordPair(*j, *k)].push_back(info);

			}

		}

	}

	MATRIX titleAry;

	for (VECSTR::iterator i = titleStrs.begin(); i != titleStrs.end(); ++i) {

		(*i)[i->size() - 1] = ' ';

		titleAry.push_back(ARRAY());

		std::string word;

		// 按题中要求处理title，去掉非字母的符号。再将title序列转化为编号序列，若某一个单词为keyword，则标记为相应的编号，若不是，则标记为-1

		for (std::string::iterator j = i->begin(); j != i->end(); ++j) {

			char cTmp = tolower(*j);

			if (cTmp != ' ' && cTmp != '\t') {

				if (isalpha(cTmp))

					word.push_back(cTmp);

			}

			else if (!word.empty()) {

				std::map<std::string, ushort>::iterator idx = wordTbl.find(word);

				titleAry.back().push_back(idx != wordTbl.end() ? idx->second : -1);

				word.clear();

			}

		}

	}

	// 每一个title中包含多个keywords pair，计算并存储每对keywords的距离

	MAPINFO titleTbl;

	for (MATRIX::iterator i = titleAry.begin(); i != titleAry.end(); ++i) {

		// 对当前title建立keywords pair，每对keywords的距离以及title编号的映射表

		std::map<ulong, ushort> curWordmap;

		for (ARRAY::iterator j = i->begin(); j != i->end() - 1; ++j) {

			if (*j != ushort(-1)) {

				for (ARRAY::iterator k = j + 1; k != i->end(); ++k) {

					if (*k != ushort(-1)) {

						// 若存在关键字对，则计算两个关键字间的距离，保留最小值

						ushort nDist = k - j;

						ushort &nWord = curWordmap[MakeWordPair(*j, *k)];

						if (nWord == 0 || nDist < nWord)

							nWord = nDist;

					}

				}

			}

		}

		// 将title处理为一个keywords pair对应一组title编号和距离

		for (std::map<ulong, ushort>::iterator j = curWordmap.begin(); j != curWordmap.end(); ++j) {

			INFO info = {j->second, i - titleAry.begin()};

			titleTbl[j->first].push_back(info);

		}

	}

	// 比较profile和title，确定哪些title属于相应的profile

	std::vector<PAIR> result;

	for (MAPINFO::iterator i = profileTbl.begin(); i != profileTbl.end(); ++i) {

		std::vector<INFO> &curP = i->second;

		std::vector<INFO> &curT = titleTbl[i->first];

		// 判断title中是否有该keywords pair

		if (!curT.empty()) {

			// 当profile和title包含相同的keywords时，将当前的profile编号排序去重

			std::sort(curP.begin(), curP.end());

			curP.erase(std::unique(curP.begin(), curP.end()), curP.end());

			std::sort(curT.begin(), curT.end());    // 将当前的title编号排序

			for (std::vector<INFO>::iterator icurP = curP.begin(), icurT = curT.begin();

				icurP != curP.end() && icurT != curT.end();) {

					// 若当前title中关键字的距离小于当前profile中阈值，则该title的编号必定属于当前之后的所有profile（包含当前profile）

					// 若大于当前阈值，则去下一个profile的阈值

				if (icurT->nDist - 1 <= icurP->nDist) {

					for (std::vector<INFO>::iterator j = icurP; j != curP.end(); ++j)

						result.push_back(std::make_pair(j->nIdx + 1, icurT->nIdx + 1));

					++icurT;

				}

				else

					++icurP;

			}

		}

		else

			result.push_back(std::make_pair(curP.front().nIdx + 1, 0));

	}

	// 对结果排序并输出

	std::sort(result.begin(), result.end());

	int nProfIdx = 0;

	for (std::vector<PAIR>::iterator i = result.begin(); i != result.end(); ++i) {

		if (i->first != nProfIdx) {

			nProfIdx = i->first;

			if (i != result.begin())

				std::cout << std::endl;

			std::cout << nProfIdx << ": ";

			if (i->second != 0)

				std::cout << i->second;

		}

		else if (i->second != 0)

				std::cout << ',' << i->second;

		}

	}

	std::cout << std::endl;

	return 0;

}