microservice basics

service to service call

一般都是通过服务发现的方式找到对方service的ip,然后直接call,feign和spring cloud loadbalancer可以干这个事情,如果k8s那都不需要专门的服务发现框架,比如eureka,直接k8s的svc name就可以做到服务发现,负载均衡

微服务各组件选择

https://cloud.tencent.com/developer/article/1888723

注册中心:nacos,替代方案eureka、consul、zookeeper
配置中心: nacos ,替代方案spring cloud config、consul
调用:feign,替代方案:resttempate
熔断:sentinel、,替代方案:Resilience4j,Hystrix
熔断监控:sentinel dashboard
负载均衡:spring cloud loadbalancer
网关:spring cloud gateway
链路:spring cloud sleuth+zipkin,替代方案:skywalking等

nacos, sentinel都属于阿里巴巴的组件,如果使用阿里巴巴组件,也可以使用代替品

spring cloud的官网介绍了spring cloud alibaba,spring cloud Netflix,还有各个组件比如spring cloud config,其实spring cloud alibaba和spring cloud config不是一类东西,spring cloud alibaba提供了很多组件,但是也许并不全,还是可以结合使用spring cloud提供的单独的组件,比如spring cloud feign, spring cloud gateway

BOM dependency vs Parent Dependancy in Maven

如题

https://stackoverflow.com/questions/61364842/bom-dependency-vs-parent-dependancy-in-maven

其实大部分功效是一样的

  • 但是bom可以有多个,parent只能有一个
  • parent还有个不一样的是继承了父pom所有的东西,bom应该就是dependency继承
  • parent还有个就是在多模块项目的时候,可以直接引用兄弟模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.6.0</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>demo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>demo</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>11</java.version>
<spring-cloud.version>2021.0.0-RC1</spring-cloud.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-config</artifactId>
</dependency>

<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-dependencies</artifactId>
<version>${spring-cloud.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>

<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>spring-milestones</id>
<name>Spring Milestones</name>
<url>https://repo.spring.io/milestone</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>

</project>

machine learning basic

方差

统计中的方差(样本方差)是每个样本值与全体样本值的平均数之差的平方值的平均数

随机变量

简单地说,随机变量是指随机事件的数量表现。例如一批注入某种毒物的动物,在一定时间内死亡的只数;某地若干名男性健康成人中,每人血红蛋白量的测定值;等等

sk-learn(fit,predict)

https://towardsdatascience.com/fit-vs-predict-vs-fit-predict-in-python-scikit-learn-f15a34a8d39f

fit() is implemented by every estimator and it accepts an input for the sample data (X) and for supervised models it also accepts an argument for labels (i.e. target data y ). Optionally, it can also accept additional sample properties such as weights etc.
这里fit就是拟合,应该就是一个训练的过程

fit:
This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables.
Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function.

predict() will perform a prediction for each test instance and it usually accepts only a single input (X)

feature vs label

https://stackoverflow.com/questions/40898019/what-is-the-difference-between-a-feature-and-a-label/40899529
Briefly, feature is input; label is output. This applies to both classification and regression problems.

A feature is one column of the data in your input set. For instance, if you’re trying to predict the type of pet someone will choose, your input features might include age, home region, family income, etc. The label is the final choice, such as dog, fish, iguana, rock, etc.

Once you’ve trained your model, you will give it sets of new input containing those features; it will return the predicted “label” (pet type) for that person.

对于预测房间的线性回归算法,feature其实就是楼层啊,户型啊,楼龄啊等等, label就是最后的房价
sklearn的回归算法的拟合函数描述如下:
fit() is implemented by every estimator and it accepts an input for the sample data (X) and for supervised models it also accepts an argument for labels (i.e. target data y )
可以看到target data其实就是label

损失函数

损失函数(loss function)是用来估量模型的预测值(我们例子中的output)与真实值(例子中的y_train)的不一致程度,它是一个非负实值函数,损失函数越小,模型的鲁棒性就越好。 我们训练模型的过程,就是通过不断的迭代计算,使用梯度下降的优化算法,使得损失函数越来越小。损失函数越小就表示算法达到意义上的最优。

Text Vectorization

https://towardsdatascience.com/getting-started-with-text-vectorization-2f2efbec6685
https://monkeylearn.com/blog/what-is-tf-idf/

Text Vectorization is the process of converting text into numerical representation. Here is some popular methods to accomplish text vectorization:
Binary Term Frequency
Bag of Words (BoW) Term Frequency
(L1) Normalized Term Frequency
(L2) Normalized TF-IDF
Word2Vec

Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization. It’s a fundamental step in the process of machine learning for analyzing data, and different vectorization algorithms will drastically affect end results, so you need to choose one that will deliver the results you’re hoping for.

梯度下降与梯度上升

在机器学习算法中,在最小化损失函数时,可以通过梯度下降思想来求得最小化的损失函数和对应的参数值,反过来,如果要求最大化的损失函数,可以通过梯度上升思想来求取。

Definitions of Train, Validation, and Test Datasets

https://machinelearningmastery.com/difference-test-validation-datasets/

训练集———–学生的课本;学生 根据课本里的内容来掌握知识。

验证集———–作业,通过作业可以知道 不同学生学习情况、进步的速度快慢。

测试集———–考试,考的题是平常都没有见过,考察学生举一反三的能力。

那么为什么要测试集呢?

a)训练集直接参与了模型调参的过程,显然不能用来反映模型真实的能力(防止课本死记硬背的学生拥有最好的成绩,即防止过拟合)。

b)验证集参与了人工调参(超参数)的过程,也不能用来最终评判一个模型(刷题库的学生不能算是学习好的学生)。

c)所以要通过最终的考试(测试集)来考察一个学(模)生(型)真正的能力(期末考试)。

验证数据应该也是有label的,但是测试数据应该是没有label的,就是用来测试看看结果怎么样

What is Noise in Machine Learning

https://deepchecks.com/glossary/noise-in-machine-learning/

Humans are prone to making mistakes when collecting data, and data collection instruments may be unreliable, resulting in dataset errors. The errors are referred to as noise. Data noise in machine learning can cause problems since the algorithm interprets the noise as a pattern and can start generalizing from it.

keras vs tf.keras

https://www.pyimagesearch.com/2019/10/21/keras-vs-tf-keras-whats-the-difference-in-tensorflow-2-0/

false positive 假阳性

https://www.pico.net/kb/what-is-a-false-positive-rate/
False Positive
A False Positive is the incorrect identification of anomalous data as such, i.e. classifying as “abnormal” data which is in fact normal.
实际上是对的,但是predict出来说有问题,比如我们邮件敏感信息预测,我们告诉用户有敏感信息,但是实际上没有,我们需要降低fp

回归评价指标MSE、RMSE、MAE、R-Squared

https://www.jianshu.com/p/9ee85fdad150

欠拟合与过拟合

https://zhuanlan.zhihu.com/p/72038532
对于深度学习或机器学习模型而言,我们不仅要求它对训练数据集有很好的拟合(训练误差),同时也希望它可以对未知数据集(测试集)有很好的拟合结果(泛化能力),所产生的测试误差被称为泛化误差。度量泛化能力的好坏,最直观的表现就是模型的过拟合(overfitting)和欠拟合(underfitting)

hyperparameter vs parameter

https://www.geeksforgeeks.org/difference-between-model-parameters-vs-hyperparameters/
简单说就是parameter就是模型训练的过程中自己自动调节的,比如线性回归的x前面的系数
hyperparameter就是我们给模型设置的参数,比如sklearn那些方法的参数,比如sklearn.cluster.KMeans的n_clusters

k8s pv

A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.
pv是sc创建的

A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see AccessModes).
pv的资源是被pvc消耗的

spring jpa

spring jpa tips

  • Jpa entitymanager不是线程安全的 ,但是我们可以在单列的DAO里面注入他是因为注入的实际是代理,这个proxy会为每一个线程创建一个entitymanager

  • springboot是默认打开spring.jpa.open-in-view=true

  • spring data repository是默认开启事务的, entitymanager并没有, 可以看debug TransactionInterceptor.java

  • spring data repository当然也是通过entitymanager 去操作entity

  • @Transactional默认只会回滚unchecked/runtime exception

  • spring的java配置比xml配置好的地方,比如可以很方便代码里面控制dev/uat/prod使用的数据库,也可以很方便的指定h2数据库的位置,可以很方便的看源码配置bean的参数,不像xml需要查询文档,当然xml和java config是可以配合使用的

  • h2是非常好的数据库,可以方便他人迅速部署自己的项目

  • hibernate N+1问题通过join fetch来解决是一个可以选择的办法

  • 在 JPA 中不建议使用 Entity 实体对象间的一对多、多对多之类的功能,也就是说 Entity 之间尽量独立,要关联查询时,通过 JPQL 或者 SQL 查询或更新就行,避免一些级联关系之间的隐式操作,代码含义更透明,方便应对后续的扩展或针对性的优化改造。

code example

https://gitprod.company.com/employeeid/spring/tree/master/spring-mvc-jpa-h2

kubernetes basic

ingress

In Kubernetes, an Ingress is an object that allows access to your Kubernetes services from outside the Kubernetes cluster. You configure access by creating a collection of rules that define which inbound connections reach which services.

https://stackoverflow.com/questions/56896490/what-exactly-kubernetes-services-are-and-how-they-are-different-from-deployments

cluster context user

Cluster defines connection endpoint for Kubernetes API of a cluster.

User defines credentials for connecting to cluster.

Context defines both cluster and user.

https://stackoverflow.com/questions/56299440/kubectl-context-vs-cluster

volumes

Volume decouples the storage from the Container. Its lifecycle is coupled to a pod. It enables safe container restarts and sharing data between containers in a pod.

Persistent Volume decouples the storage from the Pod. Its lifecycle is independent. It enables safe pod restarts and sharing data between pods.

https://stackoverflow.com/questions/51420621/what-is-the-difference-between-a-volume-and-persistent-volume

ip

Each Pod has a single IP address assigned from the Pod CIDR range of its node. This IP address is shared by all containers running within the Pod, and connects them to other Pods running in the cluster.

Each Service has an IP address, called the ClusterIP, assigned from the cluster’s VPC network. You can optionally customize the VPC network when you create the cluster.

Each Pod gets its own IP address, however in a Deployment, the set of Pods running in one moment in time could be different from the set of Pods running that application a moment later.

This leads to a problem: if some set of Pods (call them “backends”) provides functionality to other Pods (call them “frontends”) inside your cluster, how do the frontends find out and keep track of which IP address to connect to, so that the frontend can use the backend part of the workload? enter service.

exec container or pod

https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/

headless service

The working principle of a ClusterIP Service is as follows:

One Service may be backed by multiple endpoints (pods). A client accesses the cluster IP address and the request is forwarded to the real server based on the iptables rules to implement load balancing. For example, a Service is backed by two endpoints, but only the Service address is returned during DNS query. iptables determines the real server accessed by the client. However, when a headless Service is accessed, two actual endpoint records (pod IP addresses) are returned.

Therefore, the headless Service points directly to each endpoint, that is, each pod has a DNS domain name. In this way, pods can access each other, achieving inter-pod discovery and access.