UVa OJ 175韦德娱乐1946手机版 – Keywords (关键字)

Time limit: 3.000 seconds
限时3.000秒

 

Problem 问题

Many researchers are faced with an ever increasing number of journal
articles to read and find it difficult to locate papers of relevance to
their particular lines of research. However, it is possible to subscribe
to various services which claim that they will find articles that fit an
`interest profile’ that you supply, and pass them on to you. One simple
way of performing such a search is to determine whether a pair of
keywords occurs `sufficiently’ close to each other in the title of an
article. The threshold is determined by the researchers themselves, and
refers to the number of words that may occur between the pair of
keywords. Thus an archeologist interested in cave paintings could
specify her profile as “0 rock art”, meaning that she wants all
titles in which the words “rock” and “art” appear with 0
words in between, that is next to each other. This would select not only
“Rock Art of the
Maori” but also “Pop Art, Rock, and the Art of
Hang-glider Maintenance”.
无数研究人士都面临那样一个难点:阅读的杂志文章数量一日千里,要找到与她们一定研商方向相关的稿子困难重重。然则,有一部分订阅服务声称它们得以按您制定的“兴趣配置”找到匹配的篇章,并传递给您。一种简易的点子就是进行那样一种检索:分明小说中是还是不是有一对单词出现的“丰裕”
靠近。切磋人士设定三个阈值,提出一对单词之间应出现的单词数量。例如一个考古学家对岩洞壁画感兴趣,就会钦赐他的趣味配置为“0
rock
art”,意思是她希望标题中冒出“rock”和“art”且距离为0单词的全部作品,即那七个单词相互相临。那样的趣味配置会选出的标题包涵“RockArt of the Maori”和“Pop Art, 罗克, and the Art of Hang-glider
Maintenance”等。

Write a program that will read in a series of profiles followed by a
series of titles and determine which of the titles (if any) are selected
by each of the profiles. A title is selected by a profile if at least
one pair of keywords from the profile is found in the title, separated
by no more than the given threshold. For the purposes of this program, a
word is a sequence of letters, preceded by one or more blanks and
terminated by a blank or the end of line marker.
写二个程序,读入一多元的安顿文件,再读入一密密麻麻的标题,鲜明怎么着标题(要是有)会被各安顿选中。3个题名被二个配置选中仅当配置中的至少一对单词出现在标题中,并且距离没有抢先给定的阈值。对于那么些程序而言,三个单词正是字母的队列,前边有1个或五个空白,并以空白或行终止符作为实现。

 

Input 输入

Input will consist of no more than 50 profiles followed by no more than
250 titles. Each profile and title will be numbered in the order of
their appearance, starting from 1, although the numbers will not appear
in the file.
输入的安插不会超越四二十一个,标题不会超过2伍17个。每一个配备和标题都是提交的逐条编号(从1起先计数),但编号并不会在输入中提交。

 

Each profile will start with the characters “P:”, and will consist
of a number representing a threshold, followed by two or more keywords
in lower case.
各类配置都是字符“P:”开头,包含三个意味着阈值的数,接下去是八个或更加多的根本字,均为小写格局。

 

Each title will start with the characters “T:”, and will consist of
a string of characters terminated by “|”. The character “|” will
not occur anywhere in a title except at the end. No title will be longer
than 255 characters, and if necessary it will flow on to more than one
line. No line will be longer than eighty characters and each
continuation line of a title will start with at least one blank. Line
breaks will only occur between words.
每一种题目都是字符“T:”起始,会席卷四个以“|”作为实现的字符串。字符“|”不会师世在标题中除末尾外的任何职责。标题都不会超过2五拾贰个字节,假设供给会分成多行提交。全数行的尺寸都不会超过77个字符,且题指标各样续行都以至少三个空荡荡作为开首。换行只会并发在单词之间。

 

All non-alphabetic characters are to be ignored, thus the title
“Don’t Rock — the Boat as Metaphor in 1984” would be treated as
“Dont Rock the Boat as Metaphor in” and “HP2100X” will be
treated as “HPX”. The file will be terminated by a line consisting
of a single #.
全体非字母的字符都应忽视,例如标题“Don’t
Rock — the Boat as Metaphor in 1985”应被看做“Dont 罗克 the Boat as
Metaphor
in”处理,“HP2100X”将被用作“HPX”处理。输入文件以唯有一个#的一行作为实现。

 

Output 输出

Output will consist of a series of lines, one for each profile in the
input. Each line will consist of the profile number (the number of its
appearance in the input) followed by “:”, a blank space, and the
numbers of the selected titles in numerical order, separated by commas
and with no spaces.
输出由多行构成,每行对应输入的八个配置。每行都应包含布署的号码(配置在输入中的编号)跟着一个“:”,3个空格,然后是以数字顺序排列的入选题指标数码,中间以逗号隔开,不要空格。

 

Sample input 示例输入

 

Sample output 示例输出

 

Analysis 分析

那道题重点观测对寻找匹配难点的建立模型能力,实际和字符串处理关系非常小。要留意以下几点:

  1. 享有非字母的字符都不处理;
  2. 仅以空格或换行作为单词的相间符;
  3. 单词均以小写形式处理;
  4. 安插中的单词任两个都要算做一些。

前三条标准实际上就把单词给量化了,假设对单词编号建表,那么配置和标题就都成为了一堆数字(每个数字皆为单词的号子)

壹 、以正确的格局处理输入的配置,录入全部配备中的单词。遍历配置中的全数单词,建立从单词到数码的对应表(即单词表),此处能够动用stl中的map作为单词表的数据结构。接下来用单词表规格化处理全部的布局和标题,标题中不在单词表中的单词可用-1标志。

二 、建立数量配置中的单词对数码。对于配置中的每一对单词,实际上是贰个长富组:(单词对,阈值,所属配置编号)。由于在第2步已经将装有单词规格化为数字了,因而单词对正是七个整数。考虑单词的总额一定不会上万,且单词对中七个单词的顺序无所谓,由此得以用七个字节表示3个号码,然后将较小的号码放在高字节,较大的放低位构成叁个4字节的平头,那个平头就能够唯一的象征2个单词对。那么具有配置中的单词对的数目就能够两种情势来公布了,那里运用map映射,key是单词对的平头,value是四个结构体的动态数组,结构体中包涵阈值和所属配置编号。

三 、建立标题中的单词对数码。标题中的单词归纳非关键词(编号为-1)和主要性词,必要出各标题中每一对存在的首要词的最短距离,并用一种数据结构表达。那里运用和第三步相似的数据结构,每一对首要词是八个三元组:(单词对,最短距离,所属标题编号)。查找标题中具备重点词对的最短距离用暴力搜索就足以了,遍历的逐一和冒泡排序一样,复杂度是n2。全数数值求出来后,建立map映射,key是单词对的平头,value是1个结构体的动态数组,结构体中包罗最短距离和所属标题的号码。

④ 、最终就是比对配置的映射表和题指标映射表,找出同样的单词对,然后比对各自的数组。若是有最短距离小于或等于阈值的,那就在那么些标题编号和配置编号建立贰个挂钩。找出富有的配备编号-标题编号关系后,按布置编号排序,整理输出即可。

 

Solution 解答

 

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <utility>

typedef unsigned long ulong;
typedef unsigned short ushort;

// 用于存储profile中的阈值和转成数字序列的关键词组合
struct PROFILE
{
    size_t nThreshold;
    std::vector<ushort> nArray;
};

// 用于存储profile中的阈值和profile的编号,title中的包含的两个关键字之间的距离和title的编号
struct INFO
{
    size_t nDist;
    size_t nIdx;
};

typedef std::vector<std::string> VECSTR;
typedef std::vector<ushort> ARRAY;
typedef std::vector<ARRAY> MATRIX;
typedef std::map<ulong, std::vector<INFO> > MAPINFO;
typedef std::pair<size_t, size_t> PAIR;

// 将keywords对中的两个单词用数字序列表示,用一个unsigned short数据类型存储
ulong MakeWordPair(ushort w1, ushort w2)
{
    return (w1 > w2)? (w1 | (w2 << 16)) : (w2 | (w1 << 16));
}

// 排序过程,重载“<”运算符
bool operator < (const INFO &f1, const INFO &f2)
{
    return (f1.nDist < f2.nDist || (f1.nDist == f2.nDist && f1.nIdx < f2.nIdx));
}

// 去重过程,重载“==”运算符
bool operator == (const INFO &f1, const INFO &f2)
{
    return (f1.nDist == f2.nDist && f1.nIdx == f2.nIdx);
}

int main(void)
{
    VECSTR profileStrs, titleStrs;
    for (std::string str; getline(std::cin, str) && str[0] != '#'; ) {
        // 读入数据,若以“P:”开头,则表示profile,若以“T:”开头,则表示title,若以空格或者tab开头,则承接上一个title。
        switch(str[0]) {
        case 'P':
            profileStrs.push_back(std::string(str.begin() + 2, str.end()));
            break;
        case 'T':
            titleStrs.push_back(std::string(str.begin() + 2, str.end()));
            break;
        case ' ':
        case '\t':
            titleStrs.back() += str;
            break;
        }
    }
    std::map<std::string, ushort> wordTbl;         // 用于给每一个keywords编号,keywords与编号的映射关系存入wordTbl中
    std::vector<PROFILE> arrProfile;           // 将每个profile中的keywords序列转化为相应的keywords编号序列
    for (VECSTR::iterator i = profileStrs.begin(); i != profileStrs.end(); ++i) {
        i->push_back(' ');
        std::string::iterator iBeg = i->begin();
        // 由于profile由阈值和keywords串组成,遍历profile字符串,找到阈值的起始位置
        for (; iBeg != i->end() && !isdigit(*iBeg); ++iBeg);
        // 找到阈值的结束位置,读取阈值
        std::string strThre;
        std::string::iterator iEnd = iBeg;
        for (; iEnd != i->end() && isdigit(*iEnd); ++iEnd)
            strThre.push_back(*iEnd);
        // 保存每一个profile的阈值和由keywords的编号组成的序列
        arrProfile.push_back(PROFILE());
        PROFILE &cur = arrProfile.back();
        // 将阈值由文本形式转为数值形式
        cur.nThreshold = atoi(strThre.c_str()); 
        //用于存储keywords中读取的单词
        std::string word;   
        for (std::string::iterator j = iEnd; j != i->end(); ++j) {
            if (*j != ' ' && *j != '\t')
                word.push_back(*j);
            else if (!word.empty()) {
                // 更新keywords与编号的映射表
                ushort &wordIdx = wordTbl[word];
                if (wordIdx == 0)
                    wordIdx = wordTbl.size();
                // 存储keywords编号序列
                cur.nArray.push_back(wordIdx); 
                word.clear();
            }
        }
    }
    // 原输入为一个profile对应一组keywords pair,将其转变为一个keywords pair对应一个profile编号组,建立映射关系
    MAPINFO profileTbl;
    for (std::vector<PROFILE>::iterator i = arrProfile.begin(); i != arrProfile.end(); ++i)   {
        // 所有的keywords两两组合作为一个keywords pair
        for (ARRAY::iterator j = i->nArray.begin(); j != i->nArray.end() - 1; ++j) {
            for (ARRAY::iterator k = j + 1; k != i->nArray.end(); ++k) {
                INFO info = {i->nThreshold, i - arrProfile.begin()};
                profileTbl[MakeWordPair(*j, *k)].push_back(info);
            }
        }
    }

    MATRIX titleAry;
    for (VECSTR::iterator i = titleStrs.begin(); i != titleStrs.end(); ++i) {
        (*i)[i->size() - 1] = ' ';
        titleAry.push_back(ARRAY());
        std::string word;
        // 按题中要求处理title,去掉非字母的符号。再将title序列转化为编号序列,若某一个单词为keyword,则标记为相应的编号,若不是,则标记为-1
        for (std::string::iterator j = i->begin(); j != i->end(); ++j) {
            char cTmp = tolower(*j);
            if (cTmp != ' ' && cTmp != '\t') {
                if (isalpha(cTmp))
                    word.push_back(cTmp);
            }
            else if (!word.empty()) {
                std::map<std::string, ushort>::iterator idx = wordTbl.find(word);
                titleAry.back().push_back(idx != wordTbl.end() ? idx->second : -1);
                word.clear();
            }
        }
    }
    // 每一个title中包含多个keywords pair,计算并存储每对keywords的距离
    MAPINFO titleTbl;
    for (MATRIX::iterator i = titleAry.begin(); i != titleAry.end(); ++i) {
        // 对当前title建立keywords pair,每对keywords的距离以及title编号的映射表
        std::map<ulong, ushort> curWordmap;
        for (ARRAY::iterator j = i->begin(); j != i->end() - 1; ++j) {
            if (*j != ushort(-1)) {
                for (ARRAY::iterator k = j + 1; k != i->end(); ++k) {
                    if (*k != ushort(-1)) {
                        // 若存在关键字对,则计算两个关键字间的距离,保留最小值
                        ushort nDist = k - j;
                        ushort &nWord = curWordmap[MakeWordPair(*j, *k)];
                        if (nWord == 0 || nDist < nWord)
                            nWord = nDist;
                    }
                }
            }
        }
        // 将title处理为一个keywords pair对应一组title编号和距离
        for (std::map<ulong, ushort>::iterator j = curWordmap.begin(); j != curWordmap.end(); ++j) {
            INFO info = {j->second, i - titleAry.begin()};
            titleTbl[j->first].push_back(info);
        }
    }
    // 比较profile和title,确定哪些title属于相应的profile
    std::vector<PAIR> result;
    for (MAPINFO::iterator i = profileTbl.begin(); i != profileTbl.end(); ++i) {
        std::vector<INFO> &curP = i->second;
        std::vector<INFO> &curT = titleTbl[i->first];
        // 判断title中是否有该keywords pair
        if (!curT.empty()) {
            // 当profile和title包含相同的keywords时,将当前的profile编号排序去重
            std::sort(curP.begin(), curP.end());
            curP.erase(std::unique(curP.begin(), curP.end()), curP.end());
            std::sort(curT.begin(), curT.end());    // 将当前的title编号排序
            for (std::vector<INFO>::iterator icurP = curP.begin(), icurT = curT.begin(); 
                icurP != curP.end() && icurT != curT.end();) {
                    // 若当前title中关键字的距离小于当前profile中阈值,则该title的编号必定属于当前之后的所有profile(包含当前profile)
                    // 若大于当前阈值,则去下一个profile的阈值
                if (icurT->nDist - 1 <= icurP->nDist) {
                    for (std::vector<INFO>::iterator j = icurP; j != curP.end(); ++j)
                        result.push_back(std::make_pair(j->nIdx + 1, icurT->nIdx + 1));
                    ++icurT;
                }
                else
                    ++icurP;
            }
        }
        else
            result.push_back(std::make_pair(curP.front().nIdx + 1, 0));
    }
    // 对结果排序并输出
    std::sort(result.begin(), result.end());
    int nProfIdx = 0;
    for (std::vector<PAIR>::iterator i = result.begin(); i != result.end(); ++i) {
        if (i->first != nProfIdx) {
            nProfIdx = i->first;
            if (i != result.begin())
                std::cout << std::endl;
            std::cout << nProfIdx << ": ";
            if (i->second != 0)
                std::cout << i->second;
        }
        else if (i->second != 0)
                std::cout << ',' << i->second;
        }
    }
    std::cout << std::endl;
    return 0;
}

 

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图