[0.0]Analysis of Baidu search engine
Rencently, my two teammates and I is doing a project, a simplified Chinese search engine for children(in primary school). We call it "kidsearch".
Since our project will be based on Baidu search engine. I'd like to have a simple analysis of Baidu search engine.
First, Baidu is not for children to use totally. Baidu, as a commercial company, provides the public a free service of searching. It is natural that not all the contents shown on the search engine are what people need. Some of them are shown because of benefits and some other factors.Perhaps it doesn't have a great impact on adults who can distinguish the contents of good or bad. But the impact will be obvious when it comes to children. For example,we can search these keys on Baidu : "波"(notice its pictures),"交换群"(notice its results),"医院"(notice its advertisements). And these are some normal words. Don't mention the results of some even worse key words. These results of searching not just inappropriate, some of them even harmful. So, the situation has to be fixed, which is also the purpose of our project "kidsearch".
Actually, seaching on the Internet for children is easier to that for adults. So the problem is also simplified. We can just use Baidu as a tool(not exagerated), rearrange the result, fix the inproper or useless entries, and add some contents suitable for children. The search engine will be really better for children after we do some fix on it.
So, what are the contents appropriate to children?
Based on the thoughts above, I concluded the requirements of children, which are what children may need.(Perhaps it doesn't cover all at present and we will perfect it in the future)
1.Notion -- encyclopedia
2.Material -- picture, music, video
3.Entertainment -- game
4.Study -- homework, knowledge
Moreover, there are some kinds of content that children don't need:
1.advertisement
2.adult(mature) content
3.sexual or homosexual content
4.sidebar(ad. or adult content or useless for children mostly)
Now that we have known what children need, what we should do next is to tackle them one by one.
What the technology we will use?
After tried many approaches, such as PHP, Java, Python, etc. I decided to use Python to do this job because it's really convenient to do the crawl job. Although it is a bit more difficult to make webpages than PHP, it doesn't matter too much.
Besides, there are huge amount of extended library to use with Python, such as requests, flask, django, jieba, etc. I have tried all of them preliminarily.
More details will be illustrated later. And our aim is to create a search engine which children can use and like to use.
[0.0]Analysis of Baidu search engine的更多相关文章
- 开源搜索 Iveely Search Engine 0.6.0 发布 -- 黎明前的娇嫩
快两年了,Iveely Search Engine已经走过了5个版本的岁月,虽出生“贫寒”,没有任何开源基金会的支持,没有优秀的“干爹.干妈”,它凭着它的爱好者的支持,0.6.0终于破壳而出,7年前, ...
- Iveely Search Engine 0.4.0 的发布
千呼万唤始出来,Iveely Search Engine 0.4.0 的发布 经过无数个夜晚的奋战,以及无数个夜晚的失眠,Iveely Search Engine 0.4.0 终于熬出来了,这其中 ...
- dotnet cli 5.0 新特性——dotnet tool search
dotnet cli 5.0 新特性--dotnet tool search Intro .NET 5.0 SDK 的发布,给 dotnet cli 引入了一个新的特性,dotnet tool sea ...
- 微软的一篇ctr预估的论文:Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine。
周末看了一下这篇论文,觉得挺难的,后来想想是ICML的论文,也就明白为什么了. 先简单记录下来,以后会继续添加内容. 主要参考了论文Web-Scale Bayesian Click-Through R ...
- Search Engine Hacking – Manual and Automation
Search Engine Hacking – Manual and Automation Ethical Hacking Boot Camp OUR MOST POPULAR COURSE! CLI ...
- Known BREAKING CHANGES from NH3.3.3.GA to 4.0.0
Build 4.0.0.Alpha1 ============================= ** Known BREAKING CHANGES from NH3.3.3.GA to 4.0. ...
- Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models
Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models 2019-06-13 10:2 ...
- 未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11.0.0.0, Culture=neutral, PublicKeyToken...
刚开始看老师 用VS新建一个“ADO.NET 实体数据模型” 但是一直报错:未能加载文件或程序集“Microsoft.SqlServer.Management.Sdk.Sfc, Version=11. ...
- 42 Bing Search Engine Hacks
42 Bing Search Engine Hacks November 13, 2010 By Ivan Remember Bing, the search engine Microsoft lau ...
随机推荐
- 图表框架HelloCharts(1)线形图
效果图 1. 导入 .aar 2. fragment_line_chart.xml <RelativeLayout xmlns:android="http://schemas.andr ...
- 自己动手写路由器之ioctl获取网络接口信息
最近打算写一个简单路由器,里面有用到ioctl获取网络接口信息,那就先把这部分单独拿出来说一说吧! ioctl这个函数,可以用来对特殊文件的基础设备参数进行操作,它们可以完成与打开文件描述符相关联的控 ...
- HDU 5881 Tea
Tea Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Submiss ...
- JTA事务管理--配置剖析
概述 [IT168 专稿]Spring 通过AOP技术可以让我们在脱离EJB的情况下享受声明式事务的丰盛大餐,脱离Java EE应用服务器使用声明式事务的道路已经畅通无阻.但是很大部分人都还认为 ...
- SDUT 3568 Rock Paper Scissors 状压统计
就是改成把一个字符串改成三进制状压,然后分成前5位,后5位统计, 然后直接统计 f[i][j][k]代表,后5局状压为k的,前5局比和j状态比输了5局的有多少个人 复杂度是O(T*30000*25*m ...
- 酷派8150S(移动定制版)可用的第三方Recovery备份数据、刷机并精简系统内置APK经验
希望使用的第三方Recovery下载地址: ClockworkMod ROM Manager - Recoveries http://clockworkmod.com/rommanager 适配的型号 ...
- ARM指令集----寻址方式
ARM指令集可以分为跳转指令,数据处理指令,程序状态寄存器传输指令,LOAD/Store指令,协处理器指令和异常中断产生指令6类 ARM指令集的寻址方式 数据处理指令的操作数的寻址方式 字以及无符号字 ...
- C语言-简单哈希表(hash table)
腾讯三面的时候,叫我写了个哈希表,当时紧张没写好···结果跪了··· 回来后粪发涂墙,赶紧写了一个! 什么都不说了···先让我到厕所里面哭一会··· %>_<% 果然现场发挥,以及基础扎实 ...
- 一起刷LeetCode4-Median of Two Sorted Arrays
实验室太吵了...怎么办啊... ----------------------------------------------------------------------------------- ...
- [算法] 冒泡排序 Bubble Sort
冒泡排序(Bubble Sort,台湾另外一种译名为:泡沫排序)是一种简单的排序算法.它重复地走访过要排序的数列,一次比较两个元素,如果他们的顺序错误就把他们交换过来.走访数列的工作是重复地进行直到没 ...